Understanding cudaProf's output

qwertyasdf · August 13, 2009, 11:10am

I have just finished porting a heavy algorithm from D3D9 HLSL. The D3D9 version was very, very inefficient for many reasons. The CUDA version of the algorithm uses 1/2 the arithmetic instructions (not a ballpark estimate, exactly 1/2), and probably 1/5th of the global memory bandwidth as I replaced 28 passes to/from render targets/textures with 4 kernel launches, the rest of the passes take place in shared memory.

Yet, the CUDA version runs less than half as fast as the D3D9 version.

So, I ran cudaProf to try and understand what is happening:

The summary table’s ‘glob mem overall throughput (GB/s)’ column sums to 327 GB/s, but my card is only capable of 141 GB/s, less than half as much. It also looks like the texture memory reads are not counted in the glob memory reads, which means my real glob memory access is even higher. What does this mean?
The shared memory indicated for my kernel is 4124 bytes, not 4096 bytes like I allocate. This is hurting my occupancy. What else is allocating shared memory?
A kernel that I know to be very, very short claims to execute 169,000 instructions, but the problematic kernel in this case which is very long executes 200,000 instructions, only an incremental amount more. Why?
Is the optimal gld/gst efficiency 1? That would be suggested by its name, but a kernel that I am quite confident should be completely coalesced/aligned is getting only .04 gld/gst efficiency. Is reading/writing a float4 not coalesced? It is the size of a transaction (128 bits) correct? Other kernels that I expected to be coalesced as well do not appear to be either. One of my kernels reads/writes 4 byte words, in order, in 16x16 thread blocks into memory allocated with cudaMallocPitch. This kernel is reported to have only .16 gld/gst efficiency. What else could I be doing wrong to not get coalesced transactions?

That is a good start ;)

I am on a compute capability 1.3 card (GTX 280).

Razgriz · July 20, 2010, 4:58pm

I’m not sure if this ever got answered, but I am having similar issues with trying to determine what the output of cudaprof represents. In particular the instruction counts do not seem to correspond to the kernels that are being run.

:blink:

ONeill · July 21, 2010, 10:18am

The profiler doesnt measure the values for all MPs. It just measures sth for one or a few MPs (not exactly sure), then it prints the averaged results for all MPs depending on that. Its most useful to see if your optimization technique reduces e.g. smem bank conflicts or not. See it as relative results for different runs of your slightly changed app.

Topic		Replies	Views
Visual profiler results CUDA Programming and Performance	2	1668	June 16, 2009
CUDA Profiler documentation Few questions and some interesting facts CUDA Programming and Performance	5	6159	July 20, 2009
URGENT: Weird CUDA profiler results...need help with analysis CUDA Programming and Performance	1	3145	June 18, 2009
Interpreting profiler output CUDA Programming and Performance	3	1050	September 20, 2009
coalesce counter meaning CUDA Programming and Performance	5	4287	April 15, 2009
CUDA VISUAL PROFILER :Results interpretation CUDA Programming and Performance	0	4433	March 9, 2010
Cuda Profiler 1.1 - question on gst coalesced value CUDA Programming and Performance	1	1599	April 5, 2009
problems about cudaprof CUDA Programming and Performance	2	1356	February 18, 2010
SM_1.2+ Coalescing & gst efficiency What does it mean? CUDA Programming and Performance	12	8088	May 11, 2010
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1883	February 3, 2012

Understanding cudaProf's output

Related topics