Avoid calling torch.cuda.synchronize in precompile_config #126624

ezyang · 2024-05-18T21:12:59Z

🐛 Describe the bug

Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/?comment_id=7231763463618160&reply_comment_id=7235527919908381

I was debugging a deadlock and I noticed one of our threads was deadlocked on this stack:

[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/cuda/__init__.py", line 803 in synchronize
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_inductor/runtime/triton_heuristics.py", line 422 in _precompile_config
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_inductor/runtime/triton_heuristics.py", line 231 in precompile
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_inductor/codecache.py", line 3087 in triton
[trainer1|1]:  File "/tmp/torchinductor_shuaiyang/u6/cu6smvlhusvxdug2bu7lrz3zofdzwkt27gnjniqk2ti6u5jgvm4h.py", line 33 in <module>
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_inductor/runtime/compile_tasks.py", line 44 in _reload_python_module
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_inductor/codecache.py", line 2576 in load_by_key_path
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_inductor/graph.py", line 1680 in compile_to_module
[trainer1|1]:  File "/mnt/xarfuse/uid-236622/e595b196-seed-nspid4026531836_cgpid37271928-ns-4026531841/torch/_dynamo/utils.py", line 273 in time_wrapper

The deadlock doesn't technically have anything to do with this synchronize; the real problem is that this rank issued an all_to_all earlier and it has deadlocked. But why are we stuck here? Well, we've asked for a full synchronize, so of course we have to wait for all the comms to finish. This seems... bad for compile time? Like, if we've issued comms, there's no reason to wait for them to all finish before we can run some Triton tuning?!

Versions

main

cc @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

The text was updated successfully, but these errors were encountered:

ezyang added the oncall: pt2 label May 18, 2024

xmfan added module: inductor triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid calling torch.cuda.synchronize in precompile_config #126624

Avoid calling torch.cuda.synchronize in precompile_config #126624

ezyang commented May 18, 2024 •

edited by pytorch-bot bot

Avoid calling torch.cuda.synchronize in precompile_config #126624

Avoid calling torch.cuda.synchronize in precompile_config #126624

Comments

ezyang commented May 18, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

ezyang commented May 18, 2024 •

edited by pytorch-bot bot