Help interpreting profiling information?

Hey all,

I’m currently trying to profile one of the biggest time consuming kernels in our project (~2.5ms per invocation)…

The kernel runs with a grid size of 1x100, and a block size of 22x22.

The visual profiler is telling me that the “gst uncoalesced” (uncoalesced gmem stores) is 10676 for the kernel and only 500 coalesced gmem stores, but each thread in my kernel only does a SINGLE gmem store (varified via decuda)… thus the entire kernel invocation should have exactly 48400 (110022*22) gmem stores… I’m also rather sure they should all be coalesced - but I’m more worried about the fact it only reports a total of 11176 gmem stores…

There are no branches stopping any thread from making that gmem store, no early returns, no nothing.

All of my other memory operations are texture/smem/gmem LOADS(/reads), and smem stores - in fact there are only 2 gmem accesses (1 load, 1 store) in the entire kernel.

Maybe I’m misinterpreting or I misunderstand something?

Hmm, it seems I forgot how memory transactions worked over the holiday period - I forgot memory transactions for 32bit accesses are done in groups of 64-byte transactions…

In which case for my 1x100x22x22 threads I should be getting 3025 coalesced transactions - but for some reason I’m getting a lot of uncoalesced accesses. And still my 500 coalesced (= 8000 threads, considering 64byte transactions) + 10676 != 48400.

So I’m still misunderstanding something here…

Edit: I’m guessing memory transactions have to be aligned to work properly, I just stumbled across a vague reference (in one of the diagrams, it’s not explained elsewhere) saying that unaligned access isn’t coalesced (Figure 5-1).

I guess because gmem is partitioned into 32/64/128 byte segments though, it’s somewhat implied that you can’t do unaligned coalesced access - which would explain my problem (22224 isn’t a multiple of 32, thus i’ll have to pad my resulting data).

I think you forgot that the profiler profiles a single multiprocessor.

And if the profiler tells they are uncoalesced, the are pretty sure uncoalesced, so you can get your longest kernel-call down quite a bit probably.