Optimize Vectorized<float> exp() with neon simd instructions #126612

helloguo · 2024-05-18T06:53:18Z

Optimize Vectorized<float> exp() with neon simd instructions, copy from the implementation https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/v_expf.c with minor changes.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-05-18T06:53:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126612

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c28c26c with merge base 6bb9d60 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3.8-clang10-onnx / test (default, 2, 2, linux.2xlarge) (gh) (trunk failure)
onnx/dynamo/test_dynamo_with_onnxruntime_backend.py::TestDynamoWithONNXRuntime::test_llama_attention_with_local_backend_0

This comment was automatically generated by Dr. CI and updates every 15 minutes.

WenleiHe · 2024-05-20T07:07:55Z

Any measurement comparing against sleef (not sure how it's implemented there for arm)?

helloguo · 2024-05-20T15:26:33Z

Any measurement comparing against sleef (not sure how it's implemented there for arm)?

Test on MacBook Air M2 with Llama-2-7b model using AOT Inductor, the default expf from libsystem_m.dylib is 0.78 tokens/sec, sleef is 0.96 tokens/sec, this PR is 1.57 tokens/sec.

metascroy · 2024-05-29T18:28:21Z

@helloguo do you have any testing on the accuracy of this method?

I ask because I think Sleef guarantees a certain accuracy with their polynomial approximations, and so this might not be comparing apples-to-apples.

I see that Sleef does offer the ability to lower the accuracy for more speed if desired, but I haven't played around with it (see section "ULP, gradual underflow and flush-to-zero mode" in https://sleef.org/additional.xhtml).

Optimize exp with neon simd instructions

06dcc29

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 18, 2024

helloguo requested review from digantdesai and malfet May 18, 2024 06:57

modify preprocessor macros

c28c26c

desertfire added the ciflow/linux-aarch64 linux aarch64 CI workflow label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Vectorized<float> exp() with neon simd instructions #126612

Optimize Vectorized<float> exp() with neon simd instructions #126612

helloguo commented May 18, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 18, 2024 •

edited

WenleiHe commented May 20, 2024

helloguo commented May 20, 2024

metascroy commented May 29, 2024

Optimize Vectorized<float> exp() with neon simd instructions #126612

Are you sure you want to change the base?

Optimize Vectorized<float> exp() with neon simd instructions #126612

Conversation

helloguo commented May 18, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 18, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126612

✅ You can merge normally! (1 Unrelated Failure)

WenleiHe commented May 20, 2024

helloguo commented May 20, 2024

metascroy commented May 29, 2024

helloguo commented May 18, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 18, 2024 •

edited