CUDA much slower than Shader? (Solved: remove compiler -G flag)

ronb127 · October 22, 2014, 5:55am

Hello everybody,

I’m trying to do some volume rendering on an NVIDIA Quadro K4100M device.
The goal is to create a 512x512 image where each pixel represents a single ray traversing the 3D space and sampling a 3D volumetric texture on each step along its way.

I’ve implemented this algorithm in a shader (HLSL actually, but shouldn’t matter much) and lately also in CUDA. My implementations are almost exactly the same: for the shader I just draw a polygon which covers the entire 512x512 render target, then implement the ray tracing as a ‘for’ loop in the pixel shader. For CUDA I use 16x16 blocks and implement the same ‘for’ loop in the CUDA kernel, writing the result to a 512x512 target texture.

My shader performs at ~30fps, whereas my CUDA implementation is MUCH slower, running at about 1fps.

So I thought maybe something is wrong with my implementation, or maybe 3D texture sampling is just slower in CUDA. So I ran a much simpler test. I just compared the following:

A. Pixel shader which just does a ‘for’ loop (no 3D texture sampling) and outputs the result:
for (i=0; i<8000; i++)
val += i / 10000.0f;

B. CUDA kernel which does exactly the same (‘for’ loop above) and writes the output to a target texture.

Guess what? The shader still runs much faster than the CUDA kernel! I’ve also verified that if I decrease the number of iterations in the loop (from 8000 to 10) then both run extremely fast. Also verified that my loop is not getting optimized by the compiler, by doing more complex computations inside the loop.

Why would my CUDA kernel be so slow with ‘for’ loops? If the GPU parallelizes the work between its multiple cores then how should it matter if I run a shader or a CUDA kernel, why would the shader be so much faster than CUDA?

Is it possible that Quadro K4100M has a crappy CUDA implementation? Maybe it’s a driver issue? Maybe I’m not setting up CUDA correctly? or I don’t compile my .cu file with the correct options?

The way I’m testing performance in CUDA is just by running the kernel, then copying the pixels from device to host to get the result (I’ve noticed that nothing really happens until I do the cudaMemcpy2D(), as if calling the kernel just schedules the job, but it is executed only when I request the pixels).

I’d really prefer to work with CUDA, as it provides a more general environment for the problems I’m dealing with, but I must get performance more similar to that of shaders.

Any help would be highly appreciated!

Thanks.

ronb127 · October 22, 2014, 8:04am

Problem solved.
My .cu file was compiled with the -G flag in Visual Studio 2013 by default.
(CUDA C/C++ > Device > Generate GPU Debug Information)

After removing it (compiling without debug information) I receive outstanding performance!

Thanks.

njuffa · October 22, 2014, 3:43pm

That makes sense. Generally, -G causes most (if not all) optimizations to be disabled, so machine code can be mapped back to relevant source statements, variables remain trackable in the debugger, etc.

Topic		Replies	Views
Is shader still necessary? wandering learn shading language or not CUDA Programming and Performance	0	2943	August 11, 2008
Verdict: GLSL vs CUDA kind of a not-so-dead post-mortem CUDA Programming and Performance	16	27797	February 11, 2011
low performance CUDA Programming and Performance	6	6883	May 10, 2009
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38646	July 13, 2007
CUDA vs shaders CUDA Programming and Performance	2	21564	September 26, 2007
CUDA Rendering Porting GL/GLSL app to CUDA CUDA Programming and Performance	3	9838	May 9, 2009
CUDA kernel about 60 times slower when compiled with -G Nsight Visual Studio Edition	2	938	December 19, 2013
directx shading VS Cuda CUDA Programming and Performance	3	3968	July 21, 2008
Difference in Performance CUDA Programming and Performance	13	9786	August 20, 2008
Porting my renderer from C++ to CUDA: my journey CUDA Programming and Performance	18	18128	September 30, 2011

CUDA much slower than Shader? (Solved: remove compiler -G flag)

Related topics