How can I identify where coalescing can be done?

Now that I have the profiler working after adding -noprompt as an argument, I can see what’s going on. The profiler is giving me some results regarding coalesced and uncoalesced loads.

For example in method kernel2 in total
gld uncoalesced = 60762750
gld coalesced = 93318

This does not look good.

How can I identify where coalescing can be done? Is there a general rule or set of rules I can apply?

How will using texture and/or shared affect this?

From what I read in the programming guide, it has something to do with how your kernel accesses shared memory…in order to coalesce, they need to read/write memory in a consecutive fashion (i.e. thread 1 writes to memory location 0x1000, thread 2 writes to memory location 0x1004, thread 3 writes to memory location 0x1008, and so on…).

Another thing to check for is if the kernel is trying to write to the same memory location for different threads. This causes is to “skip” since it needs to wait for thread 1 to write, then for thread 2 to write.

Coalescing is about global memory, not shared.

On shared memory you’ve gotta watch out for bank conflicts instead.

Thanks for the correction…I’m still new to this ;)

Look at the Programming Guide section 5.1.2 for the coalescing description. Supercomputing 07 slides on perf (available on the CUDA U site off of CudaZone) give examples as well as performance for G80 (compute capability 1.1 and lower). GT200 and subsequent GPUs help with coalescing, especially in cases of misaligned or permuted access by threads.


Ugh, that’s sort of a tough question. The main method is “sheer force of intellect.” Learn what coallescing is, read your source code, and predict where you need it. The second method is to comment out parts of your code to zero-in, but that’s seriously hampered by the profiler optimizing other code out because it sees you’re no longer using its results. It’s still a useful method, however.

It’d be great if the profiler could produce line-by-line results.

P.S. the uncoallesced figure counts each uncoallesced access in your code as 512 (16x32). The coallesced figure counts each access as 32. Divide by those numbers respectively to get source code-level figures. E.g., your code performs about 120k loads, correct?

Would you mind illustrate this part? Because the uncoalesced ld&st number in my profiler result is not divisible by 512…Am I understanding you correctly?

Hmm, I’d guess it wouldn’t be if you’ve got divergent threads. To be honest I didn’t investigated this deeply, but that’s what I deduced from using the profiler.

The uncoallesced reads are counted multiple times compared to coallesced ones (once for each actual memory access they generate, not once for each uncoallesced “read.” We use the word in the singular, but the profiler doesn’t.) And both uncoallesced and coallesced reads are also counted multiples times per line of code (again, the profiler is somehow counting physical reads, not “logical” ones). But if you have divergent threads and, say, only one thread in the whole block is performing the uncoallesced read, then the counters would increment by one. This what I’m speculating, anyway. If you find out anything more concrete, please share it.