Explanation for performance gap #91

ThomasDebrunner · 2020-06-15T14:10:24Z

I am curious about performance measurements / theoretical performance numbers. The often stated theoretical performance of the VideoCore IV is 24GFLOPS.

The author of py-videocore manages to get to 8.32 GFLOPS with hand-optimized code:
https://qiita.com/9_ties/items/e0fdd165c1c7df6bb8ee

The fastest claimed measurement with clpeak using VC4CL is also just above 8 GFLOPS. On my raspberry pi, I measure about 6.3 GFLOPS.

So even a synthetic benchmark, and hand-optimized code can only reach about one third of the theoretical performance. For Desktop GPUs, clpeak mostly finds about the same performance as stated by the manufacturer. Where does this large performance gap come from?

pfoof · 2020-06-15T15:10:48Z

I gained at most 13.62 GFLOP/s for large count of loop iterations in FlopsCL and with float16. One of the important aspects is to balance kernel length and iterations.

I have many measurements done but will publish them at most in October.

doe300 · 2020-06-15T16:11:35Z

So one big factor are the ALUs. You only get the full 24GFLOPS if you utilize both ALUs for any clock cycle! Since the multiplication ALU does not have that many opcodes, it is definitively not utilized that much.

And of course the other problem will be the memory bandwidth. Compared to the fairly powerful computation power, the memory interfaces are very slow.

And as @pfoof hinted (I think), too big kernel code (or branch skipping too many instructions) might also lead to cache misses loading the instructions. But I don't have any numbers for that.

pfoof · 2020-11-18T13:24:21Z

Hey @doe300, I couldn't find any other contact to you and I would like to share my research for master thesis on VC4CL:
https://www.researchgate.net/publication/346000679_Performance-energy_energy_benchmarking_of_selected_parallel_programming_platforms_with_OpenCL

doe300 · 2020-11-18T15:27:31Z

@pfoof, very interesting read, thanks for sharing!

I would have hoped the Raspberry Pi fares better with power/computation, but I guess I just have to try to improve the performance 😉

I definitively have to look at your thesis in more details, especially at the detailed benchmarks, result interpretations and comparisons between Raspberry Pi CPU and GPU performance!
One thing I can alread take away: The result of section 4.4. Fibonacci adder suggests that the instruction cache misses (or general the instruction fetching) has a far greater performance impact than I thought. Definitively something I should take a look at.

doe300 added the question label Aug 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explanation for performance gap #91

Explanation for performance gap #91

ThomasDebrunner commented Jun 15, 2020

pfoof commented Jun 15, 2020

doe300 commented Jun 15, 2020

pfoof commented Nov 18, 2020

doe300 commented Nov 18, 2020

Explanation for performance gap #91

Explanation for performance gap #91

Comments

ThomasDebrunner commented Jun 15, 2020

pfoof commented Jun 15, 2020

doe300 commented Jun 15, 2020

pfoof commented Nov 18, 2020

doe300 commented Nov 18, 2020