Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

Closed
wants to merge 3 commits into from

Conversation

ColinPeppler
Copy link
Contributor

@ColinPeppler ColinPeppler commented May 18, 2024

Context

Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.

# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
                                ^          ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]

# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
                                ^    ^

It happens like this:

  • JIT starts with the original set nodes using V.graph.buffers
  • In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
  • This makes the order of buffers/scheduler nodes different.
  • Now, each node's min/max-order is different than before.
  • As a result, the proximity between two nodes is different.
    def score_fusion(self, node1: BaseSchedulerNode, node2: BaseSchedulerNode):
    """
    Assign a score (higher comes first) to the fusion of node1
    and node2. When different fusions conflict with each other,
    this is the way we decide what order to run them in.
    Our current score is based on:
    - Estimate of the saved memory operations
    - Fusions closer together in original order
    """
    memory_score = self.score_fusion_memory(node1, node2)
    proximity_score = -max(
    abs(node1.min_order - node2.max_order),
    abs(node2.min_order - node1.max_order),
    )
    return (
    node1.is_template() == config.epilogue_fusion_first and memory_score > 0,
    node1.is_reduction() == node2.is_reduction() and memory_score > 0,
    memory_score,
    proximity_score,

Error

$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
    code, linemap = self.codegen_with_cpp_wrapper()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
    return self.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
    self.scheduler.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
    self.get_backend(device).codegen_node(node)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
    final_kernel.call_kernel(final_kernel.kernel_name)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
    grid = wrapper.generate_default_grid(name, grid)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
    params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented May 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126622

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit 0d381a5 with merge base a0429c0 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ColinPeppler added a commit that referenced this pull request May 18, 2024
…TemplateBuffer

ghstack-source-id: 20fd84f9d079593008434f19d6819ce0f24a4ea1
Pull Request resolved: #126622
Comment on lines 1757 to 1759
# V.graph.buffers.remove(new_node)
# idx = V.graph.buffers.index(orig_node)
# V.graph.buffers[idx] = new_node
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other option is to swap the old buffer with new buffer.

but I think sorting buffers by name at the start will be more deterministic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm going with this option

…izing MultiTemplateBuffer"

# Context
Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.
```py
# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
                                ^          ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]

# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
                                ^    ^
```

It happens like this:
* JIT starts with the original set nodes using V.graph.buffers
* In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
* This makes the order of buffers/scheduler nodes different.
* Now, each node's min/max-order is different than before.
* As a result, the proximity between two nodes is different. https://github.com/pytorch/pytorch/blob/ad67553c5c1672d65b810acd7a6a01e11695098b/torch/_inductor/scheduler.py#L2316-L2335

# Error
```
$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
    code, linemap = self.codegen_with_cpp_wrapper()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
    return self.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
    self.scheduler.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
    self.get_backend(device).codegen_node(node)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
    final_kernel.call_kernel(final_kernel.kernel_name)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
    grid = wrapper.generate_default_grid(name, grid)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
    params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])
```





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang

[ghstack-poisoned]
ColinPeppler added a commit that referenced this pull request May 18, 2024
…TemplateBuffer

ghstack-source-id: 0a0e85ca5c882591a3591bdb72a4e744340d0968
Pull Request resolved: #126622
Comment on lines 1279 to 1280
sort_by_name = lambda n: n.get_name()
nodes = sorted(nodes, key=sort_by_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried sorting nodes by name frist but ran into test errors.

inductor/test_torchinductor.py::CpuTests::test_buffer_batch_norm_cpu

Copy link
Contributor

@chenyang78 chenyang78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the lint issue. Otherwise, LGTM. Thanks!

…izing MultiTemplateBuffer"

# Context
Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.
```py
# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
                                ^          ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]

# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
                                ^    ^
```

It happens like this:
* JIT starts with the original set nodes using V.graph.buffers
* In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
* This makes the order of buffers/scheduler nodes different.
* Now, each node's min/max-order is different than before.
* As a result, the proximity between two nodes is different. https://github.com/pytorch/pytorch/blob/ad67553c5c1672d65b810acd7a6a01e11695098b/torch/_inductor/scheduler.py#L2316-L2335

# Error
```
$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
    code, linemap = self.codegen_with_cpp_wrapper()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
    return self.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
    self.scheduler.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
    self.get_backend(device).codegen_node(node)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
    final_kernel.call_kernel(final_kernel.kernel_name)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
    grid = wrapper.generate_default_grid(name, grid)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
    params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])
```





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang

[ghstack-poisoned]
ColinPeppler added a commit that referenced this pull request May 19, 2024
…TemplateBuffer

ghstack-source-id: 8e399722a3802747a92f78344b387160dd6d6161
Pull Request resolved: #126622
@ColinPeppler
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 20, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ColinPeppler
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@eellison
Copy link
Contributor

Looks good - where did this come up ?

@houseroad
Copy link
Member

houseroad commented May 20, 2024

The issue failed some internal models

@ColinPeppler
Copy link
Contributor Author

Looks good - where did this come up ?

An internal model in production. Interesting enough the error came up after #125772.

What I saw in the logs.

#JIT
[__fusion] fusing buf1109_buf1113_buf1123_buf1127 with buf1152_buf1156
[__fusion] cannot fuse buf1109_buf1113_buf1123_buf1127_buf1152_buf1156 with buf1182_buf1186: will increase peak memory
Generating code for node buf1109_buf1113_buf1123_buf1127_buf1152_buf1156 with estimated runtime 32.787698

# AOT
[__fusion] fusing buf1152_buf1156 with buf1182_buf1186
[__fusion] cannot fuse buf1109_buf1113_buf1123_buf1127 with buf1152_buf1156_buf1182_buf1186: will increase peak memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants