ncclCommWatchdog always terminates the process and prevents error handling if CUDA context is corrupted #126544
Labels
module: c10d
Issues/PRs related to collective communications and process groups
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Describe the bug
ncclCommWatchdog uses
abort
to terminate python interpreter process if CUDA context becomes corrupted while NCCL collective was being executed. It doesn't respect settings ofTORCH_NCCL_ASYNC_ERROR_HANDLING=0
(NoHandling) orTORCH_NCCL_ASYNC_ERROR_HANDLING=2
(CleanUpOnly) orTORCH_NCCL_ENABLE_MONITORING=0
.Watchdog always terminates the process and prevents any possible error handling like (e.g. perform cleanup, log failure or notify other ranks that error happened).
Repro:
Run on a machine with at least 2 GPUs:
All possible combinations of
TORCH_NCCL_ASYNC_ERROR_HANDLING={0,1,2,3}
x
TORCH_NCCL_ENABLE_MONITORING={0,1}
also trigger the same failureTraceback:
Versions
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k
The text was updated successfully, but these errors were encountered: