I’m trying to do some volume rendering on an NVIDIA Quadro K4100M device.
The goal is to create a 512x512 image where each pixel represents a single ray traversing the 3D space and sampling a 3D volumetric texture on each step along its way.
I’ve implemented this algorithm in a shader (HLSL actually, but shouldn’t matter much) and lately also in CUDA. My implementations are almost exactly the same: for the shader I just draw a polygon which covers the entire 512x512 render target, then implement the ray tracing as a ‘for’ loop in the pixel shader. For CUDA I use 16x16 blocks and implement the same ‘for’ loop in the CUDA kernel, writing the result to a 512x512 target texture.
My shader performs at ~30fps, whereas my CUDA implementation is MUCH slower, running at about 1fps.
So I thought maybe something is wrong with my implementation, or maybe 3D texture sampling is just slower in CUDA. So I ran a much simpler test. I just compared the following:
A. Pixel shader which just does a ‘for’ loop (no 3D texture sampling) and outputs the result:
for (i=0; i<8000; i++)
val += i / 10000.0f;
B. CUDA kernel which does exactly the same (‘for’ loop above) and writes the output to a target texture.
Guess what? The shader still runs much faster than the CUDA kernel! I’ve also verified that if I decrease the number of iterations in the loop (from 8000 to 10) then both run extremely fast. Also verified that my loop is not getting optimized by the compiler, by doing more complex computations inside the loop.
Why would my CUDA kernel be so slow with ‘for’ loops? If the GPU parallelizes the work between its multiple cores then how should it matter if I run a shader or a CUDA kernel, why would the shader be so much faster than CUDA?
Is it possible that Quadro K4100M has a crappy CUDA implementation? Maybe it’s a driver issue? Maybe I’m not setting up CUDA correctly? or I don’t compile my .cu file with the correct options?
The way I’m testing performance in CUDA is just by running the kernel, then copying the pixels from device to host to get the result (I’ve noticed that nothing really happens until I do the cudaMemcpy2D(), as if calling the kernel just schedules the job, but it is executed only when I request the pixels).
I’d really prefer to work with CUDA, as it provides a more general environment for the problems I’m dealing with, but I must get performance more similar to that of shaders.
Any help would be highly appreciated!