Disclaimer
I really think this is an important issue to all of us who has a graphics background and are looking to CUDA to widen our problem-solving capabilities. So, sorry for the long post.
Please, I’d love to hear comments from the CUDA experts around here regarding my experiencies and maybe some good advice from you guys and Nvidia as well!
Nevertheless, I hope I can help all the starting guys that have had their own share of problems.
Thank you all! External Image
So I’ll go ahead and ask the dreaded question: is it possible to achieve better performance with CUDA than with GLSL?
My answer is “yes”, Nvidia’s is “not so easy!”. Let me explain, so sit back and enjoy the ride. External Image
So I have a nice happy problem to solve that can be modeled (quite naively I admit) into the “old” GPGPU paradigm of “render a fullscreen quad and trigger your fragment shaders”.
Then I get X frames-per-second.
Me-thinks: this could go really fast if I can use CUDA! So a little experiment: the same hardware, the same code, what could go wrong? And I port the exact same code and make it work in CUDA.
A few days later: struggling with the stubborn compiler about why using local memory is a Bad Idea TM, and why inline device functions shouldn’t increase the register usage by 1 million. Finally, all looks ok.
Then I get from 50 to 90% the performance of the GLSL code. And I’d like to note that just launching an empty kernel already puts CUDA way behind GLSL. What an overhead! So I’m only looking at reasonable data sizes here.
The main reason: register pressure. I’m bottlenecked at 40 registers, which equates to 25% occupancy. Here’s why I think so:
. Memory reads:
Mainly random, so using textures (more than doubled performance when compared to global memory). Found 1D textures to be faster than 2D, because of no addressing math. But really wished for 3D textures :)
. Memory writes:
All coalesced at the end of kernel execution. Can it go better than this?
. Constant memory:
Found no use for it yet. Too small to fit my input data, unless I restrict the working set. Tried placing time-invariant kernel parameters in there but didn’t make any difference.
. Shared memory:
Currently not used, so only about 140bytes from kernel parameters. Tried to use it to save registers, but never managed any gain. Maybe someone has advice in this department? ;)
. Divergent branches:
Only about 10% of total branch count. And here I was thinking this would be my doom…
By all means I’m no hardware telepath and have no absolute conviction on all these. It only seems to me that occupancy is the bottleneck. Another argument: when I decreased register count from 43 to 40 the occupancy went up by 8% and the final performance by 15%.
So the question becomes: where (and most importantly how) the GLSL is peforming its magic?
The code is 99% the same. The memory layout is similar:
GLSL -------------------- CUDA
varying/uniforms --------- kernel parameters (shared memory)
input textures 3D/1D ---- input textures 1D
output to FB -------------- coalesced global memory writes
I have an afterwards device-device memcpy to a PBO, but that is really fast.
The logic conclusion: compiler?!
So let me rephrase my fundamental question: is it possible to emulate similar GLSL behavior in hardware using CUDA?
As in: is it ideally possible, but right now the CUDA compiler is behing the GLSL one? (I know some hardware functionalities aren’t exposed like rasterizer, filtering, ROPs, etc)
Or is it a lost cause and I should have known better? Must I always go back to the drawing board and reformulate a new algorithm suited to CUDA to solve the same problem? Or “thinking like graphics” is ever a valid way to go?
I suspect the answer is: it depends on the problem! In other words: just try and see what you get every time… Am I right?
Finally, to go back to my initial proposition. I do believe it is possible to achieve better performance with CUDA than with GLSL. Two choices:
-
You can try the same algorithms when the CUDA compiler becomes as smart as the GLSL one. I.e. myself reordering code lines shouldn’t reduce register count by 3!
-
You can work hard to come up with a new way of solving your problem that better matches the CUDA paradigm.
Anyway, all this does not mean I’m giving up on CUDA. I know it has its uses. I just hope to find a way to make it work for me. Until then, back to the drawing board!
Thanks for listening! External Image