coalesced vs. uncoalesced access why not speed-up of 16x?

TheLetti · October 28, 2008, 12:12am

I have an algorithm that used to have 100 % uncoalesced accesses. After reading the Performance Chapter in the programming guide, I was able to turn that into 100 % coalesced accesses.

However, due to following statement in the Programming Guide (for CC 1.1), I would have assumed that I could expect a speed-up of 16x (provided that my algorithm is purely bandwidth-limited, which seems to be the case).

Therefore, I had 16 different memory accesses when the program was still uncoalesced. Now that it is coalesced, the number of memory accesses should have been reduced by a factor of 16.

However, in practice the execution time only decreased by about 33 %. Since the memory loads take around 400 cycles each, I assumed that this is the only dominant part. I wonder how one can explain those 33 %? What did I miss?

btw: for these measurements, I only launched a single block with 32 threads. although that can of course not give any good performance, it should at least have reflected the speedup of 16x - or that was my assumption until now at least. (Later on I tried different execution configurations, with roughly the same “lame” speedup).

any feedback greatly appreciated!

tmurray · October 28, 2008, 12:28am

Might be due to memory latency and not hiding it. Might be the amount of data you’re reading per thread.

alex_dubinsky · October 28, 2008, 12:44am

The speedup should have been major. Run the visual profiler to see if you’ve really made the switch from uncoalesced to coalesced.

TheLetti · October 28, 2008, 12:59am

well that was my assumption as well. as I said, when running <<< 1 block, 32 threads >>>, it should have meant 32 single loads (32 x 128? bytes) per instruction (suppose I am reading one float2 element = 64bit-word per thread). After coalescing, it should only be 2 loads (2 x 128 bytes).

My profiling output UNCOALESCED shows:

method=[ __globfunc__KernelName_V2P6float2iPf ] <b>gputime=[ 61746.176 ]</b> cputime=[ 61764.000 ] occupancy=[ 0.250 ] <b>gld_incoherent=[ 8089600 ] gld_coherent=[ 0 ] gst_incoherent=[ 5460480 ] gst_coherent=[ 0 ]</b>

and after my changes it shows:

method=[ __globfunc__KernelName_V6P6float2iPf ] <b>gputime=[ 39475.070 ]</b> cputime=[ 39493.000 ] occupancy=[ 0.292 ] <b>gld_incoherent=[ 0 ] gld_coherent=[ 499438 ] gst_incoherent=[ 0 ] gst_coherent=[ 1365120 ]</b>

→ speedup = 61746 / 39475 = 1,56x

btw: is there a way to activate the visual profiler on Linux, too? and can the visual profiler provide additional information (or show more than 4 performance criterions at once?)

thanks,

Michael

alex_dubinsky · October 28, 2008, 2:06am

On windows xp, yeah, the visual profiler can show all the counters at once. I don’t think it exposes any counters that don’t exist on Linux.

Now, I can understand this result for just 32 threads and 1 block. Latency will be the predominant factor by far. Are you sure you get the same result running something serious, like 256 threads and 16 blocks? If you do, your assumption that you’re bandwidth-limited must be wrong. Perhaps other sorts of conflicts (ie bank or cmem) are ballooning the duration of your other instructions.

Btw, what’s your GPU?

alex_dubinsky · October 28, 2008, 2:20am

And here’s another thought… are you sure the coalesced version is correct? From your profiler output, you can see the global loads are down 16x, but the global stores are down only 4x.

TheLetti · October 28, 2008, 3:27am

ok my fault…when launching blocks with > 32 threads, such as your suggested << 16, 256 >> config, I get a speedup of about 10x. see this profiler output.

uncoalesced:

method=[ __globfunc__Z29CholeskyOnDevice_LowerDiag_V2P6float2iPf ] gputime=[ 2119992.750 ] cputime=[ 2119991.000 ] occupancy=[ 0.333 ] gld_incoherent=[ 129433600 ] gld_coherent=[ 0 ] gst_incoherent=[ 87367680 ] gst_coherent=[ 0 ]

coalesced:

method=[ __globfunc__Z29CholeskyOnDevice_LowerDiag_V6P6float2iPf ] gputime=[ 210391.016 ] cputime=[ 210408.000 ] occupancy=[ 0.333 ] gld_incoherent=[ 0 ] gld_coherent=[ 7991008 ] gst_incoherent=[ 0 ] gst_coherent=[ 21841920 ]

still I would have assumed that even when only launching one block, 32 threads, I should have seen 16x speedup (since there are not enough threads to hide any latency, therefore 16 x 400 cycles per load should be much slower than 1 x 400 cycle per load!??)

btw: I’m not using any other type of memory in this kernel version right now (which is poor performance, I know), therefore I can’t have any smem bank or cmem conflicts.

My GPU currently in use is the GeForce 8800GTX - but I’ve just received my beloved GTX280, so I’ll be able to switch to that new card soon…

TheLetti · October 28, 2008, 5:12am

hm. that fact puzzles me as well. I have no explanation for this.

maybe it helps if I outline the structure of my code:

__global__ void MyKernel_V2(cuComplex* a, int n, float* diag)

{

	int tx = threadIdx.x;

	int bx = blockIdx.x;

	

	// temp variables:

	int i, j, k;

	int offsetA, offsetB, offsetC;

	cuComplex tempOp1;

	for(i=0; i < n; i++)

	{

		j=i+1;

		while(j < n)

		{

			k=0;

			while(k < i)

			{

				// compute offsetA, offsetB, offsetC - they are dependent on bx, i, j, k

				tempOp1 = cuCmulf(cuConjf(a[offsetA + tx]), a[offsetB + tx]);

				tempOp1 = cuCmulf(tempOp1, make_cuFloatComplex(diag[tx], 0));

				a[offsetC + tx] = cuCsubf(a[offsetC + tx], tempOp1);

				k++;	

			}

			a[offsetC + tx] = cuCdivf(a[offsetC + tx], make_cuFloatComplex(diag[tx], 0));

			j++;

		}

	}

}

Note: I have “anonymized” the computation of offsetA, B and C, i.a. for better readibility. However, all memory accesses should be consequent/coalesced, since all are accessed with the pattern “offset + tx” - correct?

The input matrix “a” is by the way a 1-dimensional array of cuComplex (= float2, 8 bytes each), and has a dimension of numOfBlocks x numOfThreads x 80 x 80.

So basically what it does per thread is it does a whole bunch of serial computations (therefore the nested for loop). However, one thread only operates on a small 80x80 matrix. Every 32 threads then execute the same piece of code (zero divergence) on a separate 80x80 matrix, with the ability of reading adjacent memory.

Since every thread handles a different section of the big input matrix “a”, no further synchronization should be necessary.

@tmurray:

→ what are you suggesting then - would breaking down the thread into chunks that handle smaller data help??

further question:

can somebody explain to me, how the above code will be executed? I mean, does every thread handle the same instruction at the same time (since we only have one instruction cache), or does that only hold true within a block (but not across different blocks)?
so is it possible, that warp 1 executes the first multiplication in line 1 of the inner loop, whereas another warp already executes the subtraction in line 3 of the inner loop? or can I assume every (active) warp is executing the same instruction?

thanks a lot so far,

michael

MisterAnderson42 · October 28, 2008, 12:26pm

The launch overhead will most certainly dominate any timing measurement of a 1-block grid.

TheLetti · October 28, 2008, 4:17pm

In general, I would agree. But since in my case a single kernel run for one thread takes already 40ms (which is very long for a thread I suppose), I think kernel overhead can no longer be a dominant factor!?

Btw: the same code on the CPU takes only 0.4 ms to execute, which is 100x faster! I assume this is both due to the optimizer capabilities of gcc4.1.2 (option “-O3”, which activates the SIMD capabilities on my quad core CPU), as well as the big cache which reduces memory accesses significantly (and I have a lot such accesses in my code…)!? Any comments on that?

Thanks,

Michael

MisterAnderson42 · October 28, 2008, 5:35pm

Ok, you’ve got a point. I missed the point where your kernel was so enormously long.

This is expected behavior. The GPU is built from a lot of relatively slow processors. You are effectively comparing one of those slow processors to a really fast one. You don’t start getting full efficiency out of the GPU until you have 10,000+ threads running all interleaving computation and data access to hide latencies. And the CPU cache certainly has something to do with it.

As to why your long 1 warp run isn’t 16x faster with coalescing, I don’t know. Coalescing isn’t really about avoiding latency as it is about accessing the full bandwidth potential of the device. If you aren’t able to reach the full bandwidth (70 GiB/s on 8800 GTX or 110 GiB/s on GTX 280), then coalescing probably won’t make much of a difference.

How much the penalty is all comes down to the details of the memory controllers which are not documented in full detail anywhere. One of the papers NVIDIA has published on the CUDA architectures mentions an advanced dynamic memory controller that can route and schedule the hundreds of requests that come in from the various multiprocessors. Perhaps in your uncoalesced case, that controller sees some unused memory lanes and uses them to do more than one uncoalesced accesses in parallel. I just don’t know.

TheLetti · October 29, 2008, 3:17am

This is expected behavior. The GPU is built from a lot of relatively slow processors. You are effectively comparing one of those slow processors to a really fast one. You don’t start getting full efficiency out of the GPU until you have 10,000+ threads running all interleaving computation and data access to hide latencies. And the CPU cache certainly has something to do with it.

As to why your long 1 warp run isn’t 16x faster with coalescing, I don’t know. Coalescing isn’t really about avoiding latency as it is about accessing the full bandwidth potential of the device. If you aren’t able to reach the full bandwidth (70 GiB/s on 8800 GTX or 110 GiB/s on GTX 280), then coalescing probably won’t make much of a difference.

How much the penalty is all comes down to the details of the memory controllers which are not documented in full detail anywhere. One of the papers NVIDIA has published on the CUDA architectures mentions an advanced dynamic memory controller that can route and schedule the hundreds of requests that come in from the various multiprocessors. Perhaps in your uncoalesced case, that controller sees some unused memory lanes and uses them to do more than one uncoalesced accesses in parallel. I just don’t know.

ok. your statement actually makes a lot of sense, and helped me understand the implementation details much better - thanks a lot!!

→ if “coalescing is not really about avoiding latency” as you pointed out, what I got wrong then was probably my assumption that 16 separate memory transactions would take 16 x ~400 cycles, which apparently is not necessarily the case.

thanks,

michael

alex_dubinsky · October 29, 2008, 3:57am

Uncoalesced reads don’t hit latency like that. They increase it only by a little bit. (The latency is mostly spent traveling through the GPU, through the “advanced dynamic crossbar” or whatnot. Maybe the uncoalesced read travels through the chip in a bundle, or else it is simply pipelined. In either case the read makes the journey to the memory controller in about the same time, whether coalesced or uncoalesced. There the multiple transactions finally take their toll and begin to add time, since DRAM accesses are serial, but not too much relative to the 400 cycles already spent. But for bandwidth it’s a different story. When the bandwidth wall is hit with many threads active and each hiding the other’s latency, the memory controller is busy full time, and the focus is shifted completely away from journey time. Any extra transactions are exactly that: extra burden.)

Within a warp or half-warp, actually.

TheLetti · October 29, 2008, 8:31pm

Uncoalesced reads don’t hit latency like that. They increase it only by a little bit. (The latency is mostly spent traveling through the GPU, through the “advanced dynamic crossbar” or whatnot. Maybe the uncoalesced read travels through the chip in a bundle, or else it is simply pipelined. In either case the read makes the journey to the memory controller in about the same time, whether coalesced or uncoalesced. There the multiple transactions finally take their toll and begin to add time, since DRAM accesses are serial, but not too much relative to the 400 cycles already spent. But for bandwidth it’s a different story. When the bandwidth wall is hit with many threads active and each hiding the other’s latency, the memory controller is busy full time, and the focus is shifted completely away from journey time. Any extra transactions are exactly that: extra burden.)

Within a warp or half-warp, actually.

That makes a lot of sense now.

Thanks for helping me out everybody!

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	1961	December 9, 2010
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5073	September 6, 2008
Putting the GPU at work CUDA Programming and Performance	21	20172	July 5, 2007
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4035	October 15, 2008
why 256byte loads slower than 128byte loads? CUDA Programming and Performance	6	6928	February 11, 2010
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4487	October 24, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1946	October 8, 2020
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29320	April 23, 2010
Uncoalesced global memory bandwidth CUDA Programming and Performance	3	2229	March 28, 2009
Waiting for global memory access. CUDA Programming and Performance	32	56336	January 31, 2008

coalesced vs. uncoalesced access why not speed-up of 16x?

Related topics