Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused CPU Adam performance #574

Open
msaroufim opened this issue Mar 28, 2024 · 14 comments
Open

Fused CPU Adam performance #574

msaroufim opened this issue Mar 28, 2024 · 14 comments
Assignees
Labels

Comments

@msaroufim
Copy link

msaroufim commented Mar 28, 2024

Describe the issue

I'm trying to leverage a fast CPU ADAM implementation and I've found many ways of doing so that provide slightly different perf. One setting is downright confusing as well so opening this issue to discuss

Repro is here

Results

  1. Existing Adam optimizer time using PyTorch eager: 3.4665 seconds
  2. Fused Adam optimizer time using optimizer_fusion: 3.2542 seconds
  3. Fused Adam optimizer time using ipex_adam_step: 3.2268 seconds
  4. Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
  5. Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)
  6. torch.compile optimizer time: 4.1160 seconds

Experiments were performed on

(fresh) (base) ubuntu@ip-172-31-48-15:~/tinyoptimizer/cpu_optimizer/ipex$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8488C
Stepping:                        8
CPU MHz:                         2400.000
BogoMIPS:                        4800.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       768 KiB
L1i cache:                       512 KiB
L2 cache:                        32 MiB
L3 cache:                        105 MiB
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtop
                                 ology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand h
                                 ypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx51
                                 2ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni
                                  vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabil
                                 ities
(fresh) (base) ubuntu@ip-172-31-48-15:~/tinyoptimizer/cpu_optimizer/ipex$ 
@xiguiw
Copy link
Contributor

xiguiw commented Mar 28, 2024

@msaroufim
What's your expected result?

@msaroufim
Copy link
Author

msaroufim commented Mar 28, 2024

This is the main one that's throwing me off

Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)

To repro replace _ by model in this line https://github.com/msaroufim/tinyoptimizer/blob/master/cpu_optimizer/ipex/class.py#L100

I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes

@jgong5
Copy link
Contributor

jgong5 commented Mar 28, 2024

Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)

I don't think it is expected and guess there is something else going on here. Do you have the profiler info and perhaps we can look into the problem with it?

I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes

If we talk about the Adam optimizer alone, 2x makes more sense to me with fused one but it depends on the model sizes, the larger model sizes, the more benefit we get from fusion.

@jgong5
Copy link
Contributor

jgong5 commented Mar 28, 2024

cc @zhuhaozhe

@msaroufim
Copy link
Author

I don't have any profile data available but the results were reliably repro-ing in the repro I linked in the original message. Let me know if there's any other info I can provide to make debugging this easier

@sanchitintel

This comment was marked as outdated.

@sanchitintel
Copy link
Contributor

Hi @msaroufim,

Upon changing model, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters())) to
_, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters())), at my end, I did see that the former was ~10% slower than the latter for the model you used (only one linear layer), but the difference wasn't as significant as what you encountered.

Nevertheless, we'll try to fix this regression. Thanks!

@sanchitintel
Copy link
Contributor

Investigating why 4 was faster than 2 or 3

@sanchitintel
Copy link
Contributor

Hi @msaroufim, when ipex.optimize is used, _copy_model_and_optimizer is called, if the model & optimizer can't be modified inplace, which is the default case.

This method is responsible for the speedup when ipex.optimize is used with fused Adam (datapoint 4 in the description, not referring to FusedCPUAdam), as opposed to datapoints 2 or 3, in which case, this method is not called.

I'll figure out what precisely in this method is resulting in a speedup.

Thanks!

@sanchitintel
Copy link
Contributor

Rather non-intuitively, deep-copying the optimizer results in the ~10% speedup for 4 over 2/3. I verified this hypothesis by simply commenting out most of the code in _copy_model_and_optimizer.

https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L46

@jgong5 @zhuhaozhe, can you please elaborate on why that'd result in a speedup? Thanks!

@sanchitintel
Copy link
Contributor

sanchitintel commented Apr 2, 2024

@jgong5 @zhuhaozhe @Guobing-Chen, one remaining issue is datapoint 1 being faster than datapoint 6 (i.e. PyTorch eager mode being faster than torch.compile for unfused Adam optimizer), which might also result in eager mode fused Adam optimizer being faster than its torch.compile counterpart (after fused Adam optimizer would be enabled in PyTorch).

@sanchitintel
Copy link
Contributor

sanchitintel commented Apr 3, 2024

Setting OMP_NUM_THREADS & MKL_NUM_THREADS environment variables (or using torch.set_num_threads) reduces the gap between 1 & 6 but doesn't eliminate it.

I used something like this (In lscpu output, cores 0-15 were on the same socket, i.e. I only used one of the two logical cores per physical core) -

OMP_NUM_THREADS=16 MKL_NUM_THREADS=16 numactl --membind=0 --cpunodebind=0 -C 0-15 python script_name.py

I had also preloaded Intel OpenMP (instead of GNU libgomp) & tcmalloc.

Benchmarking results with torch.compile (datapoint 6): https://gist.github.com/sanchitintel/c2ccda7bdd58be9c12ecf16fa4680f25
Benchmarking results with eager-mode (datapoint 1):
https://gist.github.com/sanchitintel/8789298ee88b013c2bfb4b99b36e22ef

@jgong5, with torch.compile, the bottleneck seems to be Torch compiled region, despite using torch._inductor.config.cpp.enable_kernel_profile=True.

@sanchitintel
Copy link
Contributor

@msaroufim @jgong5,

There are graph breaks with torch.compile when an unfused optimizer is used. That's resulting in the overhead.

@sanchitintel
Copy link
Contributor

sanchitintel commented Apr 3, 2024

Hi @msaroufim, these graph breaks are being used in PyTorch source-code. As per pytorch/pytorch#104053, they will be removed when solution 3 in that ticket (the entire graph is an inference graph) will be implemented. Thanks!

@jgong5 @Guobing-Chen, Dynamo logs pertaining to graph breaks are at https://gist.github.com/sanchitintel/05b19b6d162cf5cdf5dbb174c51962ec. They were collected with the environment variable TORCH_LOGS="+dynamo". Is a workaround possible? Otherwise, after fused Adam optimizer would be enabled in PyTorch, training with eager mode fused Adam optimizer may be faster than training with torch.compile.

Thanks!

@ZhaoqiongZ ZhaoqiongZ added CPU CPU specific issues Performance labels Apr 24, 2024
@yinghu5 yinghu5 assigned yinghu5 and xiguiw and unassigned yinghu5 May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants