Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation for performance gap #91

Open
ThomasDebrunner opened this issue Jun 15, 2020 · 4 comments
Open

Explanation for performance gap #91

ThomasDebrunner opened this issue Jun 15, 2020 · 4 comments
Labels

Comments

@ThomasDebrunner
Copy link

I am curious about performance measurements / theoretical performance numbers. The often stated theoretical performance of the VideoCore IV is 24GFLOPS.

The author of py-videocore manages to get to 8.32 GFLOPS with hand-optimized code:
https://qiita.com/9_ties/items/e0fdd165c1c7df6bb8ee

The fastest claimed measurement with clpeak using VC4CL is also just above 8 GFLOPS. On my raspberry pi, I measure about 6.3 GFLOPS.

So even a synthetic benchmark, and hand-optimized code can only reach about one third of the theoretical performance. For Desktop GPUs, clpeak mostly finds about the same performance as stated by the manufacturer. Where does this large performance gap come from?

@pfoof
Copy link

pfoof commented Jun 15, 2020

I gained at most 13.62 GFLOP/s for large count of loop iterations in FlopsCL and with float16. One of the important aspects is to balance kernel length and iterations.

I have many measurements done but will publish them at most in October.

@doe300
Copy link
Owner

doe300 commented Jun 15, 2020

So one big factor are the ALUs. You only get the full 24GFLOPS if you utilize both ALUs for any clock cycle! Since the multiplication ALU does not have that many opcodes, it is definitively not utilized that much.

And of course the other problem will be the memory bandwidth. Compared to the fairly powerful computation power, the memory interfaces are very slow.

And as @pfoof hinted (I think), too big kernel code (or branch skipping too many instructions) might also lead to cache misses loading the instructions. But I don't have any numbers for that.

@pfoof
Copy link

pfoof commented Nov 18, 2020

Hey @doe300, I couldn't find any other contact to you and I would like to share my research for master thesis on VC4CL:
https://www.researchgate.net/publication/346000679_Performance-energy_energy_benchmarking_of_selected_parallel_programming_platforms_with_OpenCL

@doe300
Copy link
Owner

doe300 commented Nov 18, 2020

@pfoof, very interesting read, thanks for sharing!

I would have hoped the Raspberry Pi fares better with power/computation, but I guess I just have to try to improve the performance 😉

I definitively have to look at your thesis in more details, especially at the detailed benchmarks, result interpretations and comparisons between Raspberry Pi CPU and GPU performance!
One thing I can alread take away: The result of section 4.4. Fibonacci adder suggests that the instruction cache misses (or general the instruction fetching) has a far greater performance impact than I thought. Definitively something I should take a look at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants