-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small difference makes suspicion performance decreasing #8624
Comments
Locally on Intel(R) Core(TM) i9-14900K I get 0.9s on bad.wasm and 0.74s on good.wasm Profiling good.wasm the hot loop is, according to
For
I believe WasmEdge is an LLVM-based backend so it probably unrolls this loop or something similar. |
I find that there is not difference between the binary generated by So maybe the performance decreasing in |
My guess is you're hitting micro-architectural limits or things like that. Basically various cliffs in the CPU in terms of performance where once you fall off the happy path it's both difficult to explain why and difficult to understand the effects. That's just my best guess though. |
Test Cases
cases.zip
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured by time tool).Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:For
bad.wasm
, the execution time in different runtimes are as follows:The difference between the attached two cases is as follow, i.e., changing the operand of
i32.add
fromi32.const 0
tolocal.get 1
, which can bring 5x performance decreasing onWasmtime
but has no effect onWasmEdge
.At first I thought the performance decreasing was caused by the difference, because the
good
one uses a constant while thebad
one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000
times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same,1.09s
vs.1.1s
. Therefore, I think the above performance decreasing is caused by other reasons.Profiling Information
I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.
Samples: 23K of event 'cycles', Event count (approx.): 21853956752 Overhead Command Shared Object Symbol 99.87% wasmtime jitted-93855-1.so [.] wasm[0]::function[2] ▒ 0.02% tokio-runtime-w [kernel.kallsyms] [k] __mod_memcg_lruvec_state ▒ 0.02% wasmtime ld-2.31.so [.] _dl_relocate_object ▒ 0.02% wasmtime [kernel.kallsyms] [k] __do_fault ▒ 0.01% wasmtime [kernel.kallsyms] [k] pmd_page_vaddr ▒
Versions and Environment
The text was updated successfully, but these errors were encountered: