coalesced vs. uncoalesced access why not speed-up of 16x?

I have an algorithm that used to have 100 % uncoalesced accesses. After reading the Performance Chapter in the programming guide, I was able to turn that into 100 % coalesced accesses.

However, due to following statement in the Programming Guide (for CC 1.1), I would have assumed that I could expect a speed-up of 16x (provided that my algorithm is purely bandwidth-limited, which seems to be the case).

Therefore, I had 16 different memory accesses when the program was still uncoalesced. Now that it is coalesced, the number of memory accesses should have been reduced by a factor of 16.

However, in practice the execution time only decreased by about 33 %. Since the memory loads take around 400 cycles each, I assumed that this is the only dominant part. I wonder how one can explain those 33 %? What did I miss?

btw: for these measurements, I only launched a single block with 32 threads. although that can of course not give any good performance, it should at least have reflected the speedup of 16x - or that was my assumption until now at least. (Later on I tried different execution configurations, with roughly the same “lame” speedup).

any feedback greatly appreciated!

Might be due to memory latency and not hiding it. Might be the amount of data you’re reading per thread.

The speedup should have been major. Run the visual profiler to see if you’ve really made the switch from uncoalesced to coalesced.

well that was my assumption as well. as I said, when running <<< 1 block, 32 threads >>>, it should have meant 32 single loads (32 x 128? bytes) per instruction (suppose I am reading one float2 element = 64bit-word per thread). After coalescing, it should only be 2 loads (2 x 128 bytes).

My profiling output UNCOALESCED shows:

method=[ __globfunc__KernelName_V2P6float2iPf ] <b>gputime=[ 61746.176 ]</b> cputime=[ 61764.000 ] occupancy=[ 0.250 ] <b>gld_incoherent=[ 8089600 ] gld_coherent=[ 0 ] gst_incoherent=[ 5460480 ] gst_coherent=[ 0 ]</b>

and after my changes it shows:

method=[ __globfunc__KernelName_V6P6float2iPf ] <b>gputime=[ 39475.070 ]</b> cputime=[ 39493.000 ] occupancy=[ 0.292 ] <b>gld_incoherent=[ 0 ] gld_coherent=[ 499438 ] gst_incoherent=[ 0 ] gst_coherent=[ 1365120 ]</b>

–> speedup = 61746 / 39475 = 1,56x

btw: is there a way to activate the visual profiler on Linux, too? and can the visual profiler provide additional information (or show more than 4 performance criterions at once?)



On windows xp, yeah, the visual profiler can show all the counters at once. I don’t think it exposes any counters that don’t exist on Linux.

Now, I can understand this result for just 32 threads and 1 block. Latency will be the predominant factor by far. Are you sure you get the same result running something serious, like 256 threads and 16 blocks? If you do, your assumption that you’re bandwidth-limited must be wrong. Perhaps other sorts of conflicts (ie bank or cmem) are ballooning the duration of your other instructions.

Btw, what’s your GPU?

And here’s another thought… are you sure the coalesced version is correct? From your profiler output, you can see the global loads are down 16x, but the global stores are down only 4x.

ok my fault…when launching blocks with > 32 threads, such as your suggested << 16, 256 >> config, I get a speedup of about 10x. see this profiler output.


method=[ __globfunc__Z29CholeskyOnDevice_LowerDiag_V2P6float2iPf ] gputime=[ 2119992.750 ] cputime=[ 2119991.000 ] occupancy=[ 0.333 ] gld_incoherent=[ 129433600 ] gld_coherent=[ 0 ] gst_incoherent=[ 87367680 ] gst_coherent=[ 0 ]


method=[ __globfunc__Z29CholeskyOnDevice_LowerDiag_V6P6float2iPf ] gputime=[ 210391.016 ] cputime=[ 210408.000 ] occupancy=[ 0.333 ] gld_incoherent=[ 0 ] gld_coherent=[ 7991008 ] gst_incoherent=[ 0 ] gst_coherent=[ 21841920 ]

still I would have assumed that even when only launching one block, 32 threads, I should have seen 16x speedup (since there are not enough threads to hide any latency, therefore 16 x 400 cycles per load should be much slower than 1 x 400 cycle per load!??)

btw: I’m not using any other type of memory in this kernel version right now (which is poor performance, I know), therefore I can’t have any smem bank or cmem conflicts.

My GPU currently in use is the GeForce 8800GTX - but I’ve just received my beloved GTX280, so I’ll be able to switch to that new card soon…

hm. that fact puzzles me as well. I have no explanation for this.

maybe it helps if I outline the structure of my code:

__global__ void MyKernel_V2(cuComplex* a, int n, float* diag)


	int tx = threadIdx.x;

	int bx = blockIdx.x;


	// temp variables:

	int i, j, k;

	int offsetA, offsetB, offsetC;

	cuComplex tempOp1;

	for(i=0; i < n; i++)



		while(j < n)



			while(k < i)


				// compute offsetA, offsetB, offsetC - they are dependent on bx, i, j, k

				tempOp1 = cuCmulf(cuConjf(a[offsetA + tx]), a[offsetB + tx]);

				tempOp1 = cuCmulf(tempOp1, make_cuFloatComplex(diag[tx], 0));

				a[offsetC + tx] = cuCsubf(a[offsetC + tx], tempOp1);



			a[offsetC + tx] = cuCdivf(a[offsetC + tx], make_cuFloatComplex(diag[tx], 0));





Note: I have “anonymized” the computation of offsetA, B and C, i.a. for better readibility. However, all memory accesses should be consequent/coalesced, since all are accessed with the pattern “offset + tx” - correct?

The input matrix “a” is by the way a 1-dimensional array of cuComplex (= float2, 8 bytes each), and has a dimension of numOfBlocks x numOfThreads x 80 x 80.

So basically what it does per thread is it does a whole bunch of serial computations (therefore the nested for loop). However, one thread only operates on a small 80x80 matrix. Every 32 threads then execute the same piece of code (zero divergence) on a separate 80x80 matrix, with the ability of reading adjacent memory.

Since every thread handles a different section of the big input matrix “a”, no further synchronization should be necessary.


–> what are you suggesting then - would breaking down the thread into chunks that handle smaller data help??

further question:

  • can somebody explain to me, how the above code will be executed? I mean, does every thread handle the same instruction at the same time (since we only have one instruction cache), or does that only hold true within a block (but not across different blocks)?

  • so is it possible, that warp 1 executes the first multiplication in line 1 of the inner loop, whereas another warp already executes the subtraction in line 3 of the inner loop? or can I assume every (active) warp is executing the same instruction?

thanks a lot so far,


The launch overhead will most certainly dominate any timing measurement of a 1-block grid.

In general, I would agree. But since in my case a single kernel run for one thread takes already 40ms (which is very long for a thread I suppose), I think kernel overhead can no longer be a dominant factor!?

Btw: the same code on the CPU takes only 0.4 ms to execute, which is 100x faster! I assume this is both due to the optimizer capabilities of gcc4.1.2 (option “-O3”, which activates the SIMD capabilities on my quad core CPU), as well as the big cache which reduces memory accesses significantly (and I have a lot such accesses in my code…)!? Any comments on that?



Ok, you’ve got a point. I missed the point where your kernel was so enormously long.

This is expected behavior. The GPU is built from a lot of relatively slow processors. You are effectively comparing one of those slow processors to a really fast one. You don’t start getting full efficiency out of the GPU until you have 10,000+ threads running all interleaving computation and data access to hide latencies. And the CPU cache certainly has something to do with it.

As to why your long 1 warp run isn’t 16x faster with coalescing, I don’t know. Coalescing isn’t really about avoiding latency as it is about accessing the full bandwidth potential of the device. If you aren’t able to reach the full bandwidth (70 GiB/s on 8800 GTX or 110 GiB/s on GTX 280), then coalescing probably won’t make much of a difference.

How much the penalty is all comes down to the details of the memory controllers which are not documented in full detail anywhere. One of the papers NVIDIA has published on the CUDA architectures mentions an advanced dynamic memory controller that can route and schedule the hundreds of requests that come in from the various multiprocessors. Perhaps in your uncoalesced case, that controller sees some unused memory lanes and uses them to do more than one uncoalesced accesses in parallel. I just don’t know.

ok. your statement actually makes a lot of sense, and helped me understand the implementation details much better - thanks a lot!!

–> if “coalescing is not really about avoiding latency” as you pointed out, what I got wrong then was probably my assumption that 16 separate memory transactions would take 16 x ~400 cycles, which apparently is not necessarily the case.



Uncoalesced reads don’t hit latency like that. They increase it only by a little bit. (The latency is mostly spent traveling through the GPU, through the “advanced dynamic crossbar” or whatnot. Maybe the uncoalesced read travels through the chip in a bundle, or else it is simply pipelined. In either case the read makes the journey to the memory controller in about the same time, whether coalesced or uncoalesced. There the multiple transactions finally take their toll and begin to add time, since DRAM accesses are serial, but not too much relative to the 400 cycles already spent. But for bandwidth it’s a different story. When the bandwidth wall is hit with many threads active and each hiding the other’s latency, the memory controller is busy full time, and the focus is shifted completely away from journey time. Any extra transactions are exactly that: extra burden.)

Within a warp or half-warp, actually.

That makes a lot of sense now.

Thanks for helping me out everybody!