Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON: properly implement _high intrinsics #1030

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented May 31, 2023

High intrinsics merely have an implicit vget_high or vcombine as a helper for most of the widen or narrow instructions since 64-bit can't address the upper halves of registers anymore. There is no need to complicate them further.

Comment on lines -46 to -55
simde_int16x8_private r_;
simde_int16x8_private a_ = simde_int16x8_to_private(a);
simde_int8x16_private b_ = simde_int8x16_to_private(b);

SIMDE_VECTORIZE
for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) {
r_.values[i] = a_.values[i] + b_.values[i + ((sizeof(b_.values) / sizeof(b_.values[0])) / 2)];
}

return simde_int16x8_from_private(r_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. So you think that there is no architecture/compiler combo that would produce better code from this vectorize loop than the fallback of simde_vaddw_s8(a, simde_vget_high_s8(b)) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly going for ease of implementation on this PR.

If the compiler is reasonably intelligent it would be able to detect the redundant assignment/shuffle and eliminate it. However I haven't tested codegen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC and Clang both generate identical code on a downscaled version, eliding the copy.

MSVC x86 emits a few extra instructions on /arch:IA32 either way if I use a copy loop or memcpy, but it isn't terrible. https://godbolt.org/z/Y3v4vjz46

Here is /arch:SSE2: https://godbolt.org/z/nWTKMfh7K

However, 99% of the time MSVC will use SSE2 by default — /arch:IA32 is opt-in.

GCC and Clang are the ones where scalar counts, and they emit identical code.

Long story short, 99% free code reuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold up, the story changes with uint16_t... GCC vomits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With which version does GCC vomit when compiling the uint16_t functions: the vectorized or the downscaled version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.

So you're seeing better code from this PR for GCC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Rather it is vaddw_u16 having mediocre codegen and reusing it passes those codegen issues to vaddw_high_u16. This is because GCC vectorizes it internally which is better for when SIMD is available

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Is this PR ready, or do you want to make other changes?

High intrinsics merely have an implicit vget_high or vcombine. There is no
need to complicate them further.
@mr-c
Copy link
Collaborator

mr-c commented Sep 29, 2023

@easyaspi314 hey-o, does this PR need more work or should I rebase and merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants