[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

ColinPeppler · 2024-05-18T15:48:47Z

Context

Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.

# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
                                ^          ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]

# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
                                ^    ^

It happens like this:

JIT starts with the original set nodes using V.graph.buffers
In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
This makes the order of buffers/scheduler nodes different.
Now, each node's min/max-order is different than before.

As a result, the proximity between two nodes is different.

pytorch/torch/_inductor/scheduler.py

Lines 2316 to 2335 in ad67553

    
               def score_fusion(self, node1: BaseSchedulerNode, node2: BaseSchedulerNode): 
        
                   """ 
        
                   Assign a score (higher comes first) to the fusion of node1 
        
                   and node2.  When different fusions conflict with each other, 
        
                   this is the way we decide what order to run them in. 
        
                   Our current score is based on: 
        
                   - Estimate of the saved memory operations 
        
                   - Fusions closer together in original order 
        
                   """ 
        
                   memory_score = self.score_fusion_memory(node1, node2) 
        
                   proximity_score = -max( 
        
                       abs(node1.min_order - node2.max_order), 
        
                       abs(node2.min_order - node1.max_order), 
        
                   ) 
        
                   return ( 
        
                       node1.is_template() == config.epilogue_fusion_first and memory_score > 0, 
        
                       node1.is_reduction() == node2.is_reduction() and memory_score > 0, 
        
                       memory_score, 
        
                       proximity_score,

Error

$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
    code, linemap = self.codegen_with_cpp_wrapper()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
    return self.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
    self.scheduler.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
    self.get_backend(device).codegen_node(node)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
    final_kernel.call_kernel(final_kernel.kernel_name)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
    grid = wrapper.generate_default_grid(name, grid)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
    params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])

Stack from ghstack (oldest at bottom):

-> [inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang

…TemplateBuffer [ghstack-poisoned]

pytorch-bot · 2024-05-18T15:48:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126622

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit 0d381a5 with merge base a0429c0 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…TemplateBuffer ghstack-source-id: 20fd84f9d079593008434f19d6819ce0f24a4ea1 Pull Request resolved: #126622

ColinPeppler · 2024-05-18T15:51:04Z

torch/_inductor/scheduler.py

+            # V.graph.buffers.remove(new_node)
+            # idx = V.graph.buffers.index(orig_node)
+            # V.graph.buffers[idx] = new_node


other option is to swap the old buffer with new buffer.

but I think sorting buffers by name at the start will be more deterministic

nvm going with this option

…izing MultiTemplateBuffer" # Context Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions. ```py # JIT -- buf3 is a MultiTemplateBuffer V.graph.buffers = [buf0, buf1, buf2, buf3, buf4] ^ ^ # JIT pass calls finalize_multi_template_buffers() V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*] # AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*] ^ ^ ``` It happens like this: * JIT starts with the original set nodes using V.graph.buffers * In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers. * This makes the order of buffers/scheduler nodes different. * Now, each node's min/max-order is different than before. * As a result, the proximity between two nodes is different. https://github.com/pytorch/pytorch/blob/ad67553c5c1672d65b810acd7a6a01e11695098b/torch/_inductor/scheduler.py#L2316-L2335 # Error ``` $ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion ====================================================================== FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune) ---------------------------------------------------------------------- Traceback (most recent call last): ... File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn code, linemap = self.codegen_with_cpp_wrapper() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper return self.codegen() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen self.scheduler.codegen() File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper r = func(*args, **kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen self.get_backend(device).codegen_node(node) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node return self._triton_scheduling.codegen_node(node) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule final_kernel.call_kernel(final_kernel.kernel_name) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel grid = wrapper.generate_default_grid(name, grid) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid params is not None AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1']) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang [ghstack-poisoned]

…TemplateBuffer ghstack-source-id: 0a0e85ca5c882591a3591bdb72a4e744340d0968 Pull Request resolved: #126622

ColinPeppler · 2024-05-18T16:27:43Z

torch/_inductor/scheduler.py

+        sort_by_name = lambda n: n.get_name()
+        nodes = sorted(nodes, key=sort_by_name)


tried sorting nodes by name frist but ran into test errors.

inductor/test_torchinductor.py::CpuTests::test_buffer_batch_norm_cpu

chenyang78

Please fix the lint issue. Otherwise, LGTM. Thanks!

…izing MultiTemplateBuffer" # Context Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions. ```py # JIT -- buf3 is a MultiTemplateBuffer V.graph.buffers = [buf0, buf1, buf2, buf3, buf4] ^ ^ # JIT pass calls finalize_multi_template_buffers() V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*] # AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*] ^ ^ ``` It happens like this: * JIT starts with the original set nodes using V.graph.buffers * In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers. * This makes the order of buffers/scheduler nodes different. * Now, each node's min/max-order is different than before. * As a result, the proximity between two nodes is different. https://github.com/pytorch/pytorch/blob/ad67553c5c1672d65b810acd7a6a01e11695098b/torch/_inductor/scheduler.py#L2316-L2335 # Error ``` $ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion ====================================================================== FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune) ---------------------------------------------------------------------- Traceback (most recent call last): ... File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn code, linemap = self.codegen_with_cpp_wrapper() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper return self.codegen() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen self.scheduler.codegen() File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper r = func(*args, **kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen self.get_backend(device).codegen_node(node) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node return self._triton_scheduling.codegen_node(node) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule final_kernel.call_kernel(final_kernel.kernel_name) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel grid = wrapper.generate_default_grid(name, grid) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid params is not None AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1']) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang [ghstack-poisoned]

…TemplateBuffer ghstack-source-id: 8e399722a3802747a92f78344b387160dd6d6161 Pull Request resolved: #126622

ColinPeppler · 2024-05-20T13:58:06Z

@pytorchbot merge

pytorchmergebot · 2024-05-20T14:00:40Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ColinPeppler · 2024-05-20T16:55:37Z

@pytorchbot merge

pytorchmergebot · 2024-05-20T16:57:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eellison · 2024-05-20T17:09:51Z

Looks good - where did this come up ?

houseroad · 2024-05-20T21:23:11Z

The issue failed some internal models

ColinPeppler · 2024-05-21T17:35:44Z

Looks good - where did this come up ?

An internal model in production. Interesting enough the error came up after #125772.

What I saw in the logs.

#JIT
[__fusion] fusing buf1109_buf1113_buf1123_buf1127 with buf1152_buf1156
[__fusion] cannot fuse buf1109_buf1113_buf1123_buf1127_buf1152_buf1156 with buf1182_buf1186: will increase peak memory
Generating code for node buf1109_buf1113_buf1123_buf1127_buf1152_buf1156 with estimated runtime 32.787698

# AOT
[__fusion] fusing buf1152_buf1156 with buf1182_buf1186
[__fusion] cannot fuse buf1109_buf1113_buf1123_buf1127 with buf1152_buf1156_buf1182_buf1186: will increase peak memory

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing Multi…

2b88882

…TemplateBuffer [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels May 18, 2024

ColinPeppler added a commit that referenced this pull request May 18, 2024

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing Multi…

1be019c

…TemplateBuffer ghstack-source-id: 20fd84f9d079593008434f19d6819ce0f24a4ea1 Pull Request resolved: #126622

ColinPeppler commented May 18, 2024

View reviewed changes

ColinPeppler added a commit that referenced this pull request May 18, 2024

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing Multi…

1d6e24d

…TemplateBuffer ghstack-source-id: 0a0e85ca5c882591a3591bdb72a4e744340d0968 Pull Request resolved: #126622

ColinPeppler commented May 18, 2024

View reviewed changes

ColinPeppler requested review from chenyang78, muchulee8 and eellison May 18, 2024 16:29

chenyang78 approved these changes May 18, 2024

View reviewed changes

ColinPeppler added a commit that referenced this pull request May 19, 2024

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing Multi…

c9a2c9f

…TemplateBuffer ghstack-source-id: 8e399722a3802747a92f78344b387160dd6d6161 Pull Request resolved: #126622

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 20, 2024

pytorchmergebot added the merging label May 20, 2024

pytorchmergebot removed the merging label May 20, 2024

muchulee8 added the release notes: inductor label May 20, 2024

ColinPeppler added the topic: not user facing topic category label May 20, 2024

pytorchmergebot added the merging label May 20, 2024

pytorchmergebot closed this in 8c38d0c May 20, 2024

pytorchmergebot added Merged and removed merging labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

ColinPeppler commented May 18, 2024 •

edited

pytorch-bot bot commented May 18, 2024 •

edited

ColinPeppler May 18, 2024 •

edited

ColinPeppler May 18, 2024

ColinPeppler May 18, 2024

chenyang78 left a comment

ColinPeppler commented May 20, 2024

pytorchmergebot commented May 20, 2024

ColinPeppler commented May 20, 2024

pytorchmergebot commented May 20, 2024

eellison commented May 20, 2024

houseroad commented May 20, 2024 •

edited

ColinPeppler commented May 21, 2024

	def score_fusion(self, node1: BaseSchedulerNode, node2: BaseSchedulerNode):
	"""
	Assign a score (higher comes first) to the fusion of node1
	and node2. When different fusions conflict with each other,
	this is the way we decide what order to run them in.

	Our current score is based on:
	- Estimate of the saved memory operations
	- Fusions closer together in original order
	"""
	memory_score = self.score_fusion_memory(node1, node2)
	proximity_score = -max(
	abs(node1.min_order - node2.max_order),
	abs(node2.min_order - node1.max_order),
	)
	return (
	node1.is_template() == config.epilogue_fusion_first and memory_score > 0,
	node1.is_reduction() == node2.is_reduction() and memory_score > 0,
	memory_score,
	proximity_score,

		sort_by_name = lambda n: n.get_name()
		nodes = sorted(nodes, key=sort_by_name)

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer #126622

Conversation

ColinPeppler commented May 18, 2024 • edited

Context

Error

pytorch-bot bot commented May 18, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126622

✅ You can merge normally! (7 Unrelated Failures)

ColinPeppler May 18, 2024 • edited

Choose a reason for hiding this comment

ColinPeppler May 18, 2024

Choose a reason for hiding this comment

ColinPeppler May 18, 2024

Choose a reason for hiding this comment

chenyang78 left a comment

Choose a reason for hiding this comment

ColinPeppler commented May 20, 2024

pytorchmergebot commented May 20, 2024

Merge failed

ColinPeppler commented May 20, 2024

pytorchmergebot commented May 20, 2024

Merge started

eellison commented May 20, 2024

houseroad commented May 20, 2024 • edited

ColinPeppler commented May 21, 2024

ColinPeppler commented May 18, 2024 •

edited

pytorch-bot bot commented May 18, 2024 •

edited

ColinPeppler May 18, 2024 •

edited

houseroad commented May 20, 2024 •

edited