Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Vectorized<float> exp() with neon simd instructions #126612

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

helloguo
Copy link

@helloguo helloguo commented May 18, 2024

Optimize Vectorized<float> exp() with neon simd instructions, copy from the implementation https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/v_expf.c with minor changes.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Copy link

pytorch-bot bot commented May 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126612

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c28c26c with merge base 6bb9d60 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 18, 2024
@WenleiHe
Copy link
Contributor

Any measurement comparing against sleef (not sure how it's implemented there for arm)?

@helloguo
Copy link
Author

Any measurement comparing against sleef (not sure how it's implemented there for arm)?

Test on MacBook Air M2 with Llama-2-7b model using AOT Inductor, the default expf from libsystem_m.dylib is 0.78 tokens/sec, sleef is 0.96 tokens/sec, this PR is 1.57 tokens/sec.

@desertfire desertfire added the ciflow/linux-aarch64 linux aarch64 CI workflow label May 20, 2024
@metascroy
Copy link
Contributor

@helloguo do you have any testing on the accuracy of this method?

I ask because I think Sleef guarantees a certain accuracy with their polynomial approximations, and so this might not be comparing apples-to-apples.

I see that Sleef does offer the ability to lower the accuracy for more speed if desired, but I haven't played around with it (see section "ULP, gradual underflow and flush-to-zero mode" in https://sleef.org/additional.xhtml).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/linux-aarch64 linux aarch64 CI workflow module: cpu CPU specific problem (e.g., perf, algorithm)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants