Help interpreting profiling information?

Smokey · January 9, 2009, 2:02am

Hey all,

I’m currently trying to profile one of the biggest time consuming kernels in our project (~2.5ms per invocation)…

The kernel runs with a grid size of 1x100, and a block size of 22x22.

The visual profiler is telling me that the “gst uncoalesced” (uncoalesced gmem stores) is 10676 for the kernel and only 500 coalesced gmem stores, but each thread in my kernel only does a SINGLE gmem store (varified via decuda)… thus the entire kernel invocation should have exactly 48400 (110022*22) gmem stores… I’m also rather sure they should all be coalesced - but I’m more worried about the fact it only reports a total of 11176 gmem stores…

There are no branches stopping any thread from making that gmem store, no early returns, no nothing.

All of my other memory operations are texture/smem/gmem LOADS(/reads), and smem stores - in fact there are only 2 gmem accesses (1 load, 1 store) in the entire kernel.

Maybe I’m misinterpreting or I misunderstand something?

Smokey · January 9, 2009, 2:18am

Hmm, it seems I forgot how memory transactions worked over the holiday period - I forgot memory transactions for 32bit accesses are done in groups of 64-byte transactions…

In which case for my 1x100x22x22 threads I should be getting 3025 coalesced transactions - but for some reason I’m getting a lot of uncoalesced accesses. And still my 500 coalesced (= 8000 threads, considering 64byte transactions) + 10676 != 48400.

So I’m still misunderstanding something here…

Edit: I’m guessing memory transactions have to be aligned to work properly, I just stumbled across a vague reference (in one of the diagrams, it’s not explained elsewhere) saying that unaligned access isn’t coalesced (Figure 5-1).

I guess because gmem is partitioned into 32/64/128 byte segments though, it’s somewhat implied that you can’t do unaligned coalesced access - which would explain my problem (22224 isn’t a multiple of 32, thus i’ll have to pad my resulting data).

E.D_Riedijk · January 9, 2009, 6:15am

I think you forgot that the profiler profiles a single multiprocessor.

And if the profiler tells they are uncoalesced, the are pretty sure uncoalesced, so you can get your longest kernel-call down quite a bit probably.

Topic		Replies	Views
compute visual profiler - gst coalesced/uncoalesced CUDA Programming and Performance	0	747	July 21, 2011
Cuda Profiler 1.1 - question on gst coalesced value CUDA Programming and Performance	1	1639	April 5, 2009
Interpreting profiler output CUDA Programming and Performance	3	1118	September 20, 2009
CUDA VISUAL PROFILER :Results interpretation CUDA Programming and Performance	0	4462	March 9, 2010
cuda profiler error about coalesced store CUDA Programming and Performance	2	1137	January 6, 2010
Visual profiler settings CUDA Programming and Performance	7	7658	January 27, 2010
cuda profiler and gt280 ava. performance counters CUDA Programming and Performance	10	3122	May 15, 2009
Visual profiler and compute capability 1.3 CUDA Programming and Performance	4	10016	May 3, 2010
coalesce counter meaning CUDA Programming and Performance	5	4406	April 15, 2009
Profiler coalescing counters On a GTX 260 CUDA Programming and Performance	4	2462	August 13, 2008

Help interpreting profiling information?

Related topics