{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":235860204,"defaultBranch":"master","name":"DeepSpeed","ownerLogin":"microsoft","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2020-01-23T18:35:18.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/6154722?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1718140097.0","currentOid":""},"activityList":{"items":[{"before":"542d2a2ea78d296dd13024a2907b09e6c136f471","after":"230615bd402150ad6dbe68e53b96ac9ffa090122","ref":"refs/heads/adk9/phi3-small","pushedAt":"2024-06-11T23:36:47.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"adk9","name":"Abhishek Kulkarni","path":"/adk9","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11399?s=80&v=4"},"commit":{"message":"Add mup_embedding_multiplier","shortMessageHtmlLink":"Add mup_embedding_multiplier"}},{"before":"dd4868fea01900257a8ac115b132764a1807f42e","after":"702bad7ef0355da63700dcd2e96f20adfbbbaaee","ref":"refs/heads/adk9/phi3-inference","pushedAt":"2024-06-11T23:35:15.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"adk9","name":"Abhishek Kulkarni","path":"/adk9","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11399?s=80&v=4"},"commit":{"message":"Phi-3 mini has no unmapped param","shortMessageHtmlLink":"Phi-3 mini has no unmapped param"}},{"before":"e1accbb30ff913b35258d86aaa8e3adf0d77534e","after":"542d2a2ea78d296dd13024a2907b09e6c136f471","ref":"refs/heads/adk9/phi3-small","pushedAt":"2024-06-11T23:00:15.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"adk9","name":"Abhishek Kulkarni","path":"/adk9","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11399?s=80&v=4"},"commit":{"message":"Move input-layernorm inside decoder layer","shortMessageHtmlLink":"Move input-layernorm inside decoder layer"}},{"before":"cb5714ef1d0f35449f1a2eed92efca1ec1a4f637","after":"dd4868fea01900257a8ac115b132764a1807f42e","ref":"refs/heads/adk9/phi3-inference","pushedAt":"2024-06-11T22:59:28.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"adk9","name":"Abhishek Kulkarni","path":"/adk9","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11399?s=80&v=4"},"commit":{"message":"Fix formatting","shortMessageHtmlLink":"Fix formatting"}},{"before":"702bad7ef0355da63700dcd2e96f20adfbbbaaee","after":"cb5714ef1d0f35449f1a2eed92efca1ec1a4f637","ref":"refs/heads/adk9/phi3-inference","pushedAt":"2024-06-11T22:56:44.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"adk9","name":"Abhishek Kulkarni","path":"/adk9","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11399?s=80&v=4"},"commit":{"message":"Move input-layernorm inside the decoder layer","shortMessageHtmlLink":"Move input-layernorm inside the decoder layer"}},{"before":"b6e24adb43257628592aaaa772c328efac30f797","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5613-a41729f6a5b27b9df324b735bd3b16f387414272","pushedAt":"2024-06-11T22:17:36.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"a41729f6a5b27b9df324b735bd3b16f387414272","after":"b6e24adb43257628592aaaa772c328efac30f797","ref":"refs/heads/master","pushedAt":"2024-06-11T22:17:35.000Z","pushType":"merge_queue_merge","commitsCount":1,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"fixes in _partition_param_sec function (#5613)\n\nThere are few fixes:\n- When param.ds_secondary_tensor is not None and the param has not been\nupdated we don't need to update the param.ds_secondary_tensor.\n- In HPU the 2nd tensor partition will always be completed before the\nall-gather, so we don't need to add synchronize().","shortMessageHtmlLink":"fixes in _partition_param_sec function (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2333551305\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5613\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5613/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5613\">#5613</a>)"}},{"before":null,"after":"b6e24adb43257628592aaaa772c328efac30f797","ref":"refs/heads/gh-readonly-queue/master/pr-5613-a41729f6a5b27b9df324b735bd3b16f387414272","pushedAt":"2024-06-11T21:08:17.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"fixes in _partition_param_sec function (#5613)\n\nThere are few fixes:\n- When param.ds_secondary_tensor is not None and the param has not been\nupdated we don't need to update the param.ds_secondary_tensor.\n- In HPU the 2nd tensor partition will always be completed before the\nall-gather, so we don't need to add synchronize().","shortMessageHtmlLink":"fixes in _partition_param_sec function (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2333551305\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5613\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5613/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5613\">#5613</a>)"}},{"before":"4deb40de67728dec2606327e8d63d3cb7028cc94","after":"93c374f2b8c9815f6a39db3508c04da1ee049201","ref":"refs/heads/umchand/test_compiler","pushedAt":"2024-06-10T23:41:52.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"umchand","name":null,"path":"/umchand","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/111922840?s=80&v=4"},"commit":{"message":"Compiler test using Bert model","shortMessageHtmlLink":"Compiler test using Bert model"}},{"before":"23fc25a654bea170db46bb505820a05d0f8ffc24","after":"ddab18894000790b27602ebfceab884cb6a95333","ref":"refs/heads/duli/cuda_op_builder","pushedAt":"2024-06-10T22:18:48.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"duli2012","name":"Du Li","path":"/duli2012","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2879800?s=80&v=4"},"commit":{"message":"modify a few imports","shortMessageHtmlLink":"modify a few imports"}},{"before":"d72db03ce326d8108d959f5ca76c7441fb9c40b1","after":"b97d514b2c3a71fac1a9dfa4cfe99019c4c4d8fa","ref":"refs/heads/mrwyattii/pydantic-2-support","pushedAt":"2024-06-10T17:56:33.000Z","pushType":"push","commitsCount":6,"pusher":{"login":"adk9","name":"Abhishek Kulkarni","path":"/adk9","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11399?s=80&v=4"},"commit":{"message":"Merge branch 'master' into mrwyattii/pydantic-2-support","shortMessageHtmlLink":"Merge branch 'master' into mrwyattii/pydantic-2-support"}},{"before":"a41729f6a5b27b9df324b735bd3b16f387414272","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5606-1ef9b029cd6e9f64fa46956ea278bf60eec9dd51","pushedAt":"2024-06-10T13:56:02.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"1ef9b029cd6e9f64fa46956ea278bf60eec9dd51","after":"a41729f6a5b27b9df324b735bd3b16f387414272","ref":"refs/heads/master","pushedAt":"2024-06-10T13:56:01.000Z","pushType":"merge_queue_merge","commitsCount":1,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"Fix overlap communication of ZeRO stage 1 and 2 (#5606)\n\n`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor`\nonly sets reduction stream waiting for default stream. This is ok in\ncases where the computation time is longer than the communication time,\nbut when the communication time is longer, it may result in a rewrite of\nthe ipg_buffer when the communication is not completed.\n\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0)\n\n\n\nTo fix this bug, the easiest way is just add default stream to wait for\nreduction stream at the **same point**. For example, in point 1, the\n`reduction stream` needs to wait for '2', so we add a wait_stream to\n`reduction stream` waiting for `default stream`. Also, the `default\nstream` needs to wait for 'A', so we need to add a wait_stream to\n`default stream` waiting for `reduction stream` before the 'B'.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1)\n\n\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5523, wait_stream does not\ncause host synchronization.\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5545, the modification is\nmore simple and the logic is the same, just waiting for what needs to\nwait.\n\n---\n\nWith this modification, losses of Qwen-1.5 with and without overlap_comm\nare totally identical.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f)\n\n---\n\nOn the contrary, there is an obvious gap with a small sequence length,\nwhich means a short computation time.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d)\n\nCo-authored-by: gp513 <guopeng34@huawei.com>\nCo-authored-by: CurryRice233 <nmeia@qq.com>\nCo-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>\nCo-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>","shortMessageHtmlLink":"Fix overlap communication of ZeRO stage 1 and 2 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2330623800\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5606\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5606/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5606\">#5606</a>)"}},{"before":"1ef9b029cd6e9f64fa46956ea278bf60eec9dd51","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5632-6e2899fbc6d9367615e6eb35e46b07c3e33e8651","pushedAt":"2024-06-10T12:26:56.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"6e2899fbc6d9367615e6eb35e46b07c3e33e8651","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5590-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-10T12:26:56.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"31a57fa392aea72481e082bd2f11d8cd4e6d8efe","after":"1ef9b029cd6e9f64fa46956ea278bf60eec9dd51","ref":"refs/heads/master","pushedAt":"2024-06-10T12:26:55.000Z","pushType":"merge_queue_merge","commitsCount":2,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"stage_1_and_2: optimize clip calculation to use clamp (#5632)\n\ninstead of \"if\" that causes host/device synchronization and introduces a\nbubble, while clamp is hapenning on the device","shortMessageHtmlLink":"stage_1_and_2: optimize clip calculation to use clamp (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2342131258\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5632\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5632/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5632\">#5632</a>)"}},{"before":null,"after":"a41729f6a5b27b9df324b735bd3b16f387414272","ref":"refs/heads/gh-readonly-queue/master/pr-5606-1ef9b029cd6e9f64fa46956ea278bf60eec9dd51","pushedAt":"2024-06-10T10:41:58.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"Fix overlap communication of ZeRO stage 1 and 2 (#5606)\n\n`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor`\nonly sets reduction stream waiting for default stream. This is ok in\ncases where the computation time is longer than the communication time,\nbut when the communication time is longer, it may result in a rewrite of\nthe ipg_buffer when the communication is not completed.\n\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0)\n\n\n\nTo fix this bug, the easiest way is just add default stream to wait for\nreduction stream at the **same point**. For example, in point 1, the\n`reduction stream` needs to wait for '2', so we add a wait_stream to\n`reduction stream` waiting for `default stream`. Also, the `default\nstream` needs to wait for 'A', so we need to add a wait_stream to\n`default stream` waiting for `reduction stream` before the 'B'.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1)\n\n\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5523, wait_stream does not\ncause host synchronization.\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5545, the modification is\nmore simple and the logic is the same, just waiting for what needs to\nwait.\n\n---\n\nWith this modification, losses of Qwen-1.5 with and without overlap_comm\nare totally identical.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f)\n\n---\n\nOn the contrary, there is an obvious gap with a small sequence length,\nwhich means a short computation time.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d)\n\nCo-authored-by: gp513 <guopeng34@huawei.com>\nCo-authored-by: CurryRice233 <nmeia@qq.com>\nCo-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>\nCo-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>","shortMessageHtmlLink":"Fix overlap communication of ZeRO stage 1 and 2 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2330623800\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5606\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5606/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5606\">#5606</a>)"}},{"before":null,"after":"1ef9b029cd6e9f64fa46956ea278bf60eec9dd51","ref":"refs/heads/gh-readonly-queue/master/pr-5632-6e2899fbc6d9367615e6eb35e46b07c3e33e8651","pushedAt":"2024-06-10T10:39:41.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"stage_1_and_2: optimize clip calculation to use clamp (#5632)\n\ninstead of \"if\" that causes host/device synchronization and introduces a\nbubble, while clamp is hapenning on the device","shortMessageHtmlLink":"stage_1_and_2: optimize clip calculation to use clamp (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2342131258\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5632\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5632/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5632\">#5632</a>)"}},{"before":null,"after":"6e2899fbc6d9367615e6eb35e46b07c3e33e8651","ref":"refs/heads/gh-readonly-queue/master/pr-5590-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-10T10:39:16.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo (#5590)\n\nWe have been encountered an accuracy issue when running Torch compile +\nzero3 + activation checkpointing. Specifically some grads gets is zeroed\n(running without torch compile, this issue is not encountered). This\nissue was also reproduced by Umesh Chand from the DS team. We found that\nin the Pytorch repo torch compile has been specifically disabled using\nthe label: @torch._disable_dynamo()\nreference to the WA in the Pytorch repo\n(https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324)\nthis indicates that there is some issue with torch compile and\ncheckpointing (not necessarily DS related).\n\ngiven that the checkpointing function in DeepSpeed is based on the\nPytorch function, We propose to adopt this WA to ensure correct behavior\n(it can be removed later if the underlying issue is fixed)\nNote: this shouldn't impact non-troch compile cases.\n\n---------\n\nCo-authored-by: Olatunji Ruwase <olruwase@microsoft.com>","shortMessageHtmlLink":"WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2325658211\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5590\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5590/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5590\">#…</a>"}},{"before":"ca2aa39994ebac15b843ea2b2f1253cfd846c21f","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5606-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-10T10:08:25.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":null,"after":"ca2aa39994ebac15b843ea2b2f1253cfd846c21f","ref":"refs/heads/gh-readonly-queue/master/pr-5606-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-10T04:51:41.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"Fix overlap communication of ZeRO stage 1 and 2 (#5606)\n\n`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor`\nonly sets reduction stream waiting for default stream. This is ok in\ncases where the computation time is longer than the communication time,\nbut when the communication time is longer, it may result in a rewrite of\nthe ipg_buffer when the communication is not completed.\n\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0)\n\n\n\nTo fix this bug, the easiest way is just add default stream to wait for\nreduction stream at the **same point**. For example, in point 1, the\n`reduction stream` needs to wait for '2', so we add a wait_stream to\n`reduction stream` waiting for `default stream`. Also, the `default\nstream` needs to wait for 'A', so we need to add a wait_stream to\n`default stream` waiting for `reduction stream` before the 'B'.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1)\n\n\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5523, wait_stream does not\ncause host synchronization.\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5545, the modification is\nmore simple and the logic is the same, just waiting for what needs to\nwait.\n\n---\n\nWith this modification, losses of Qwen-1.5 with and without overlap_comm\nare totally identical.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f)\n\n---\n\nOn the contrary, there is an obvious gap with a small sequence length,\nwhich means a short computation time.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d)\n\nCo-authored-by: gp513 <guopeng34@huawei.com>\nCo-authored-by: CurryRice233 <nmeia@qq.com>\nCo-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>\nCo-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>","shortMessageHtmlLink":"Fix overlap communication of ZeRO stage 1 and 2 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2330623800\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5606\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5606/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5606\">#5606</a>)"}},{"before":"dc70c16799c2c9cf4f70e1b7a9f99c0fde74bc4b","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5606-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-09T23:05:30.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":null,"after":"dc70c16799c2c9cf4f70e1b7a9f99c0fde74bc4b","ref":"refs/heads/gh-readonly-queue/master/pr-5606-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-09T22:53:49.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"Fix overlap communication of ZeRO stage 1 and 2 (#5606)\n\n`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor`\nonly sets reduction stream waiting for default stream. This is ok in\ncases where the computation time is longer than the communication time,\nbut when the communication time is longer, it may result in a rewrite of\nthe ipg_buffer when the communication is not completed.\n\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0)\n\n\n\nTo fix this bug, the easiest way is just add default stream to wait for\nreduction stream at the **same point**. For example, in point 1, the\n`reduction stream` needs to wait for '2', so we add a wait_stream to\n`reduction stream` waiting for `default stream`. Also, the `default\nstream` needs to wait for 'A', so we need to add a wait_stream to\n`default stream` waiting for `reduction stream` before the 'B'.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1)\n\n\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5523, wait_stream does not\ncause host synchronization.\n\nCompared with the modification of\nhttps://github.com/microsoft/DeepSpeed/issues/5545, the modification is\nmore simple and the logic is the same, just waiting for what needs to\nwait.\n\n---\n\nWith this modification, losses of Qwen-1.5 with and without overlap_comm\nare totally identical.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f)\n\n---\n\nOn the contrary, there is an obvious gap with a small sequence length,\nwhich means a short computation time.\n\n\n![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d)\n\nCo-authored-by: gp513 <guopeng34@huawei.com>\nCo-authored-by: CurryRice233 <nmeia@qq.com>\nCo-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>\nCo-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>","shortMessageHtmlLink":"Fix overlap communication of ZeRO stage 1 and 2 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2330623800\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5606\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5606/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5606\">#5606</a>)"}},{"before":"9fdff649e297a68ddfedb0a17f136812080ed000","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5590-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-09T22:45:10.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":null,"after":"9fdff649e297a68ddfedb0a17f136812080ed000","ref":"refs/heads/gh-readonly-queue/master/pr-5590-31a57fa392aea72481e082bd2f11d8cd4e6d8efe","pushedAt":"2024-06-09T22:34:06.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo (#5590)\n\nWe have been encountered an accuracy issue when running Torch compile +\nzero3 + activation checkpointing. Specifically some grads gets is zeroed\n(running without torch compile, this issue is not encountered). This\nissue was also reproduced by Umesh Chand from the DS team. We found that\nin the Pytorch repo torch compile has been specifically disabled using\nthe label: @torch._disable_dynamo()\nreference to the WA in the Pytorch repo\n(https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324)\nthis indicates that there is some issue with torch compile and\ncheckpointing (not necessarily DS related).\n\ngiven that the checkpointing function in DeepSpeed is based on the\nPytorch function, We propose to adopt this WA to ensure correct behavior\n(it can be removed later if the underlying issue is fixed)\nNote: this shouldn't impact non-troch compile cases.\n\n---------\n\nCo-authored-by: Olatunji Ruwase <olruwase@microsoft.com>","shortMessageHtmlLink":"WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2325658211\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5590\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5590/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5590\">#…</a>"}},{"before":"49a59204e5ca78b8941c870defa378cd64393d47","after":"23fc25a654bea170db46bb505820a05d0f8ffc24","ref":"refs/heads/duli/cuda_op_builder","pushedAt":"2024-06-07T23:09:24.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"duli2012","name":"Du Li","path":"/duli2012","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2879800?s=80&v=4"},"commit":{"message":"Ideally, we want to move cuda specific builder info from folder op_builder to folder cuda, but the current implementation is a bit hard to spit, so will need to revisit it.","shortMessageHtmlLink":"Ideally, we want to move cuda specific builder info from folder op_bu…"}},{"before":"3a5bd59477cc46571ebaf95bce359507e0801b87","after":null,"ref":"refs/heads/loadams/pin-transformers-mii-conversation-is-deprecated","pushedAt":"2024-06-07T22:20:53.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"loadams","name":"Logan Adams","path":"/loadams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/114770087?s=80&v=4"}},{"before":"69adeab6568c02a503d9de65cd8b9e18137b6456","after":"31a57fa392aea72481e082bd2f11d8cd4e6d8efe","ref":"refs/heads/master","pushedAt":"2024-06-07T22:20:49.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"loadams","name":"Logan Adams","path":"/loadams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/114770087?s=80&v=4"},"commit":{"message":"Pin transformers version for MII tests (#5629)\n\n      MII legacy tests use `from transformers import Conversation`\r\n[here](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/legacy/method_table.py#L8).\r\n\r\nConversation was removed from transformers\r\n[here](https://github.com/huggingface/transformers/pull/31165) so we pin\r\nto a version before that before unpinning.","shortMessageHtmlLink":"Pin transformers version for MII tests (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2341046968\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5629\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5629/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5629\">#5629</a>)"}},{"before":"6ae678e522953ee8c1b455f33215f62c7eb32a03","after":null,"ref":"refs/heads/gh-readonly-queue/master/pr-5629-69adeab6568c02a503d9de65cd8b9e18137b6456","pushedAt":"2024-06-07T22:13:27.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":null,"after":"6ae678e522953ee8c1b455f33215f62c7eb32a03","ref":"refs/heads/gh-readonly-queue/master/pr-5629-69adeab6568c02a503d9de65cd8b9e18137b6456","pushedAt":"2024-06-07T22:02:55.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"Pin transformers version for MII tests (#5629)\n\nMII legacy tests use `from transformers import Conversation`\n[here](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/legacy/method_table.py#L8).\n\nConversation was removed from transformers\n[here](https://github.com/huggingface/transformers/pull/31165) so we pin\nto a version before that before unpinning.","shortMessageHtmlLink":"Pin transformers version for MII tests (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2341046968\" data-permission-text=\"Title is private\" data-url=\"https://github.com/microsoft/DeepSpeed/issues/5629\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/microsoft/DeepSpeed/pull/5629/hovercard\" href=\"https://github.com/microsoft/DeepSpeed/pull/5629\">#5629</a>)"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEYr4p8wA","startCursor":null,"endCursor":null}},"title":"Activity · microsoft/DeepSpeed"}