Avoid calling torch.cuda.synchronize in precompile_config #126624
Labels
module: inductor
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Describe the bug
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/?comment_id=7231763463618160&reply_comment_id=7235527919908381
I was debugging a deadlock and I noticed one of our threads was deadlocked on this stack:
The deadlock doesn't technically have anything to do with this synchronize; the real problem is that this rank issued an all_to_all earlier and it has deadlocked. But why are we stuck here? Well, we've asked for a full synchronize, so of course we have to wait for all the comms to finish. This seems... bad for compile time? Like, if we've issued comms, there's no reason to wait for them to all finish before we can run some Triton tuning?!
Versions
main
cc @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire
The text was updated successfully, but these errors were encountered: