Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing an immediate operand from 1 to 0 makes suspicion performance decreasing #8648

Closed
hungryzzz opened this issue May 17, 2024 · 2 comments

Comments

@hungryzzz
Copy link
Contributor

hungryzzz commented May 17, 2024

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 3.07s
  • WasmEdge: 1.18s

For bad.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 4.68s
  • WasmEdge: 1.18s

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

➜  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusions to the machine code generatd by Wasmtime:

  1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
  2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?
➜  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82:	c7 02 01 00 00 00    	movl   $0x1,(%rdx)
---
>   82:	c7 02 00 00 00 00    	movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	41 bb 01 00 00 00    	mov    $0x1,%r11d
44:	c7 01 01 00 00 00    	movl   $0x1,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
56:	44 88 99 4c 01 00 00 	mov    %r11b,0x14c(%rcx)
5d:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

# bad.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
44:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	b8 01 00 00 00       	mov    $0x1,%eax
55:	88 81 4c 01 00 00    	mov    %al,0x14c(%rcx)
5b:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
       │      xor    %r10d,%r10d
  0.01 │      mov    %r10b,(%rcx)
  0.01 │      movl   $0x0,(%rcx)
 20.82 │      movl   $0x0,(%rcx)
  0.05 │      movl   $0x0,(%rcx)
 18.60 │      mov    $0x1,%r11d
  0.03 │      movl   $0x1,(%rcx)
 20.42 │      movl   $0x0,(%rcx)
 22.37 │      movl   $0x0,(%rcx)
 17.68 │      mov    %r11b,0x14c(%rcx)
  0.01 │      mov    %r10b,0x1bc(%rcx)

# bad.wasm
       │      xor    %r10d,%r10d
       │      mov    %r10b,(%rcx)
 24.94 │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
 75.04 │      movl   $0x0,(%rcx)
  0.01 │      mov    $0x1,%eax
       │      mov    %al,0x14c(%rcx)
  0.01 │      mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

  • Wasmtime version or commit: 91ec9a5
  • Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
  • Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
@cfallin
Copy link
Member

cfallin commented May 17, 2024

# good.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	41 bb 01 00 00 00    	mov    $0x1,%r11d
44:	c7 01 01 00 00 00    	movl   $0x1,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
56:	44 88 99 4c 01 00 00 	mov    %r11b,0x14c(%rcx)
5d:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

# bad.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
44:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	b8 01 00 00 00       	mov    $0x1,%eax
55:	88 81 4c 01 00 00    	mov    %al,0x14c(%rcx)
5b:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

@hungryzzz, would you mind isolating this into a microbenchmark written in assembly and measuring it by itself? (With appropriate values in rcx to make the stores valid and a loop to run a suitable number of iterations, etc...) It would be good to understand if this is really the difference that matters.

I am fascinated by this case because the assembly differs in only two ways: the mov $1, reg is shifted downward by three instructions, and the register allocator chose rax rather than r11. The latter shouldn't matter at all; and the former shouldn't matter on any out-of-order CPU. The alignment of both sequences is the same, too.

(As to why the instruction moves: in the good.wasm case the immediate is shared by two of the instructions [EDIT: or at least would have been if the 32-bit store of constant 1 didn't fold the immediate in], so we only load it once, before its first use. Separately we could probably have another lowering rule for mov [byte immediate], (mem) but that would change both cases here.)

@hungryzzz
Copy link
Contributor Author

hungryzzz commented May 29, 2024

@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot be replayed in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a newer generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants