How can I identify where coalescing can be done?

chrismc · September 10, 2008, 11:53am

Now that I have the profiler working after adding -noprompt as an argument, I can see what’s going on. The profiler is giving me some results regarding coalesced and uncoalesced loads.

For example in method kernel2 in total
gld uncoalesced = 60762750
gld coalesced = 93318

This does not look good.

How can I identify where coalescing can be done? Is there a general rule or set of rules I can apply?

How will using texture and/or shared affect this?

jack · September 10, 2008, 2:52pm

From what I read in the programming guide, it has something to do with how your kernel accesses shared memory…in order to coalesce, they need to read/write memory in a consecutive fashion (i.e. thread 1 writes to memory location 0x1000, thread 2 writes to memory location 0x1004, thread 3 writes to memory location 0x1008, and so on…).

Another thing to check for is if the kernel is trying to write to the same memory location for different threads. This causes is to “skip” since it needs to wait for thread 1 to write, then for thread 2 to write.

cbuchner1 · September 10, 2008, 2:59pm

Coalescing is about global memory, not shared.

On shared memory you’ve gotta watch out for bank conflicts instead.

jack · September 10, 2008, 3:16pm

Thanks for the correction…I’m still new to this ;)

paulius · September 10, 2008, 6:37pm

Look at the Programming Guide section 5.1.2 for the coalescing description. Supercomputing 07 slides on perf (available on the CUDA U site off of CudaZone) give examples as well as performance for G80 (compute capability 1.1 and lower). GT200 and subsequent GPUs help with coalescing, especially in cases of misaligned or permuted access by threads.

Paulius

alex_dubinsky · September 10, 2008, 6:46pm

Ugh, that’s sort of a tough question. The main method is “sheer force of intellect.” Learn what coallescing is, read your source code, and predict where you need it. The second method is to comment out parts of your code to zero-in, but that’s seriously hampered by the profiler optimizing other code out because it sees you’re no longer using its results. It’s still a useful method, however.

It’d be great if the profiler could produce line-by-line results.

P.S. the uncoallesced figure counts each uncoallesced access in your code as 512 (16x32). The coallesced figure counts each access as 32. Divide by those numbers respectively to get source code-level figures. E.g., your code performs about 120k loads, correct?

Oceanian · September 18, 2008, 9:45am

Would you mind illustrate this part? Because the uncoalesced ld&st number in my profiler result is not divisible by 512…Am I understanding you correctly?

alex_dubinsky · September 18, 2008, 9:42pm

Hmm, I’d guess it wouldn’t be if you’ve got divergent threads. To be honest I didn’t investigated this deeply, but that’s what I deduced from using the profiler.

The uncoallesced reads are counted multiple times compared to coallesced ones (once for each actual memory access they generate, not once for each uncoallesced “read.” We use the word in the singular, but the profiler doesn’t.) And both uncoallesced and coallesced reads are also counted multiples times per line of code (again, the profiler is somehow counting physical reads, not “logical” ones). But if you have divergent threads and, say, only one thread in the whole block is performing the uncoallesced read, then the counters would increment by one. This what I’m speculating, anyway. If you find out anything more concrete, please share it.

Topic		Replies	Views
How can I tell if my memory accesses are being coalesced? CUDA Programming and Performance	5	1246	June 23, 2009
Interpreting profiler output CUDA Programming and Performance	3	1034	September 20, 2009
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	5943	October 29, 2008
coalesce counter meaning CUDA Programming and Performance	5	4257	April 15, 2009
Why is the profiler screwy? CUDA Programming and Performance	5	2338	September 7, 2008
Coalescing for compute capability above 2.0? CUDA Programming and Performance	2	780	August 22, 2015
Coalescing - beginner question CUDA Programming and Performance	10	1772	June 23, 2010
Improving Cuda-kernels performance CUDA Programming and Performance	5	9367	February 10, 2009
A couple of questions CUDA Programming and Performance	5	2057	December 2, 2008
Texture memory performance No speedup by using texture memory CUDA Programming and Performance	11	7100	March 14, 2008

How can I identify where coalescing can be done?

Related topics