Interpreting profiler output

mikeheck · September 18, 2009, 12:49am

Trying to figure out if/when memory coalescing is working in my kernel. Found a bunch of posts asking for more explanation of the profiler output, but the responses didn’t seem to really answer the question. :-)

I don’t think I’m getting coalesced loads in my actual kernel, so I backed off to a simple test kernel. I’ve got 4 ints per “cell”, naturally stored on the host as an array of structs. The first test kernel accesses the data as an AOS. The second kernel accesses the data as 4 separate arrays. As expected the second kernel is way faster. Here is the profiler output. I have a GTX 260, so I’m not requesting the gld_incoherent signal (it is zero as documented).

method=[ Z16TestCuda_Kernel1If6float3EvjjPKT_S3_PKjPS1 ] gputime=[ 2736.960 ] cputime=[ 2776.364 ]
occupancy=[ 1.000 ] gld_coherent=[ 69456 ] gld_32b=[ 0 ] gld_64b=[ 0 ] gld_128b=[ 69456 ]
method=[ Z16TestCuda_Kernel2If6float3EvjjPKT_S3_PKjS5_S5_S5_PS1 ] gputime=[ 398.496 ] cputime=[ 419.559 ]
occupancy=[ 1.000 ] gld_coherent=[ 34720 ] gld_32b=[ 0 ] gld_64b=[ 34720 ] gld_128b=[ 0 ]

My question is: Aside from the execution time, what, if anything, useful is the profiler telling me about my memory loads?

Notice that the first kernel, doing uncoalesced reads, is much slower as expected, but… according to the profiler it’s actually doing MORE coherent loads and it’s doing 128 byte loads, while the fast coalesced kernel is doing 64 byte loads. By that measure, shouldn’t the first kernel be faster? What’s up with that?

Thanks,
-Mike

paulius · September 18, 2009, 8:18pm

First, profiler signals for gld/gst are counted per TPC (so, for 3 multiprocessors).

What the above is telling you, is that the first kernel lead to 69,456 bus transactions of size 128B (a total of 128B*69,456 = ~8.5 MB were read across the bus by the 3 multiprocessors during the execution of the kernel). Similarly, the second kernel lead to 34,720 64B transactions, for a total of ~2.1 MB were read across the bus by the TPC. Assuming that your application requested the same amount of bytes in both cases, this indicates that the first kernel was suffering from uncoalescing. If you use the Visual Profiler, it will actually compute the memory throughputs for you (in GB/s). You can then readily compare that to what kind of memory throughput your application is observing (since app would be counting only the “useful” bytes), which again would give you an idea of how badly uncoalesced your access is.

Paulius

mikeheck · September 19, 2009, 11:06pm

What the above is telling you, is that the first kernel lead to 69,456 bus transactions of size 128B (a total of 128B*69,456 = ~8.5 MB were read across the bus by the 3 multiprocessors during the execution of the kernel). Similarly, the second kernel lead to 34,720 64B transactions, for a total of ~2.1 MB were read across the bus by the TPC. Assuming that your application requested the same amount of bytes in both cases, this indicates that the first kernel was suffering from uncoalescing. If you use the Visual Profiler, it will actually compute the memory throughputs for you (in GB/s). You can then readily compare that to what kind of memory throughput your application is observing (since app would be counting only the “useful” bytes), which again would give you an idea of how badly uncoalesced your access is.

That’s helpful, thanks. To summarize:

Transferring larger chunks is not necessarily a good thing, and
Uncoalesced reads are probably happening if the GPU is transferring more bytes than necessary.

-Mike

paulius · September 20, 2009, 5:54pm

The second one is definitely correct. The first statement by itself is not necessarily correct - for example, if you read 64-bit words, perfectly aligned, you will be transferring 128B segments across the bus. But, since all bytes that move across the bus are used by the application, you’ll get good performance. You can check Section 5.1.2.1 of the CUDA 2.3 Programming Guide on how hardware figures out which memory segments to fetch when your code accesses memory.

Paulius