Changing an immediate operand from 1 to 0 makes suspicion performance decreasing #8648

hungryzzz · 2024-05-17T16:26:32Z

Test Cases

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

Wasmtime: 3.07s
WasmEdge: 1.18s

For bad.wasm, the execution time in different runtimes are as follows:

Wasmtime: 4.68s
WasmEdge: 1.18s

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

➜  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0

;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusions to the machine code generatd by Wasmtime:

Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?

➜  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82:	c7 02 01 00 00 00    	movl   $0x1,(%rdx)
---
>   82:	c7 02 00 00 00 00    	movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	41 bb 01 00 00 00    	mov    $0x1,%r11d
44:	c7 01 01 00 00 00    	movl   $0x1,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
56:	44 88 99 4c 01 00 00 	mov    %r11b,0x14c(%rcx)
5d:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

# bad.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
44:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	b8 01 00 00 00       	mov    $0x1,%eax
55:	88 81 4c 01 00 00    	mov    %al,0x14c(%rcx)
5b:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
       │      xor    %r10d,%r10d
  0.01 │      mov    %r10b,(%rcx)
  0.01 │      movl   $0x0,(%rcx)
 20.82 │      movl   $0x0,(%rcx)
  0.05 │      movl   $0x0,(%rcx)
 18.60 │      mov    $0x1,%r11d
  0.03 │      movl   $0x1,(%rcx)
 20.42 │      movl   $0x0,(%rcx)
 22.37 │      movl   $0x0,(%rcx)
 17.68 │      mov    %r11b,0x14c(%rcx)
  0.01 │      mov    %r10b,0x1bc(%rcx)

# bad.wasm
       │      xor    %r10d,%r10d
       │      mov    %r10b,(%rcx)
 24.94 │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
 75.04 │      movl   $0x0,(%rcx)
  0.01 │      mov    $0x1,%eax
       │      mov    %al,0x14c(%rcx)
  0.01 │      mov    %r10b,0x1bc(%rcx)

# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

Wasmtime version or commit: 91ec9a5
Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz

The text was updated successfully, but these errors were encountered:

cfallin · 2024-05-17T18:00:42Z

# good.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	41 bb 01 00 00 00    	mov    $0x1,%r11d
44:	c7 01 01 00 00 00    	movl   $0x1,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
56:	44 88 99 4c 01 00 00 	mov    %r11b,0x14c(%rcx)
5d:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

# bad.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
44:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	b8 01 00 00 00       	mov    $0x1,%eax
55:	88 81 4c 01 00 00    	mov    %al,0x14c(%rcx)
5b:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

@hungryzzz, would you mind isolating this into a microbenchmark written in assembly and measuring it by itself? (With appropriate values in rcx to make the stores valid and a loop to run a suitable number of iterations, etc...) It would be good to understand if this is really the difference that matters.

I am fascinated by this case because the assembly differs in only two ways: the mov $1, reg is shifted downward by three instructions, and the register allocator chose rax rather than r11. The latter shouldn't matter at all; and the former shouldn't matter on any out-of-order CPU. The alignment of both sequences is the same, too.

(As to why the instruction moves: in the good.wasm case the immediate is shared by two of the instructions [EDIT: or at least would have been if the 32-bit store of constant 1 didn't fold the immediate in], so we only load it once, before its first use. Separately we could probably have another lowering rule for mov [byte immediate], (mem) but that would change both cases here.)

hungryzzz · 2024-05-29T13:06:53Z

@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot be replayed in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a newer generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!

hungryzzz closed this as completed May 29, 2024

cfallin mentioned this issue May 29, 2024

Changing an immediate operand from 0 to 1 makes suspicion performance decreasing #8706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing an immediate operand from 1 to 0 makes suspicion performance decreasing #8648

Changing an immediate operand from 1 to 0 makes suspicion performance decreasing #8648

hungryzzz commented May 17, 2024 •

edited

cfallin commented May 17, 2024 •

edited

hungryzzz commented May 29, 2024 •

edited

Changing an immediate operand from 1 to 0 makes suspicion performance decreasing #8648

Changing an immediate operand from 1 to 0 makes suspicion performance decreasing #8648

Comments

hungryzzz commented May 17, 2024 • edited

Test Cases

Steps to Reproduce

Expected Results & Actual Results

Profiling Information

Versions and Environment

cfallin commented May 17, 2024 • edited

hungryzzz commented May 29, 2024 • edited

hungryzzz commented May 17, 2024 •

edited

cfallin commented May 17, 2024 •

edited

hungryzzz commented May 29, 2024 •

edited