Uncoalesced global memory bandwidth

It is well known and mentioned specifically in the programming guide that uncoalesced global memory access is much slower than coalesced global memory access. The guide states that 32-bit coalesced is the fastest, followed by 64-bit and 128-bit access, whereas for uncoalesced access, 128-bit access is the fastest, then 64-bit and then 32-bit.

The question I have is: by how much specifically?

If 32-bit coalesced global memory access should give something close to the theoretical maximum bandwidth on a device, what is the exact fraction of that you can expect with every other combination (64-bit coalesced, 128-bit coalesced, 32-bit uncoalesced, 64-bit uncoalesced, 128-bit uncoalesced)? So, coalesced 32-bit is 1/1 and everything else is some other fraction.

I’m aware this may depend on the actual device in question, but for an example let’s say on the G80 series 8800 GTX.

Intuitively my guess is that CUDA will issue (device global memory interface width / total read size) memory transactions of exactly the interface width in size for coalesced access, but each stream processor can only one 32-bit memory transaction in one instruction, so for 64-bit and 128-bit access this would mean the accesses, while coalesced, would be 1/2 and 1/4 respectively in speed relative to coalesced 32-bit access, regardless of the bus width of device.

For uncoalesced 32-bit access, CUDA will be issueing one memory transaction per thread which only occupies 32-bits of the bus width, so for a 384-bit interface on an 8800 GTX this would be 1/12 the speed of coalesced 32-bit access. For uncoalesced 64-bit and 128-bit access, I’m guessing CUDA attempts to issue 2 transactions per 2 threads and 4 transactions per 4 threads respectively, such that in a 64-bit access the first thread will access the first 32-bit word while the 2nd thread simultaneously accesses the 2nd 32-bit word of it’s 64-bit memory operation, then vise versa in the next transaction, and in the same fashion for a 128-bit access. This would make the fractions for uncoalesced access on a 8800 GTX 1/12 for 32-bit, 1/6 for 64-bit, and 1/3 for 128-bit.

Obviously this is just speculation but it seems to make sense that is is how it might be implemented in hardware and it fits with CUDA’s description of bandwidth rates for the various types of global memory access. One interesting thing is that in this way, 128-bit uncoalesced access is faster than 128-bit coalesced access, but this is all speculation. Does anyone have any hard numbers here?

Hmm. I guess I haven’t read the recent docs. On compute 1.0 hardware, it used to say that 64-bit coalesced is usually a bit faster than 32-bit and much faster than 128-bit. On compute 1.1 and newer, the differences are small.

Benchmark it! I wrote a code a long ways back to do this: http://forums.nvidia.com/index.php?showtop…mp;#entry292058

That thread only has results for compute 1.0 and 1.1 boards. Here is the output on a GTX 285:

copy_gmem<char> - Bandwidth:	34.925729 GiB/s

copy_gmem<float> - Bandwidth:	121.494258 GiB/s

copy_gmem<float2> - Bandwidth:	126.038586 GiB/s

copy_gmem<float4> - Bandwidth:	104.040466 GiB/s

copy_tex<float> - Bandwidth:	124.938593 GiB/s

copy_tex<float2> - Bandwidth:	129.273315 GiB/s

copy_tex<float4> - Bandwidth:	130.567617 GiB/s

write_only<char> - Bandwidth:	18.899835 GiB/s

write_only<float> - Bandwidth:	73.363346 GiB/s

write_only<float2> - Bandwidth:	75.512689 GiB/s

write_only<float4> - Bandwidth:	73.769699 GiB/s

read_only_gmem<char> - Bandwidth:	13.575759 GiB/s

read_only_gmem<float> - Bandwidth:	69.645715 GiB/s

read_only_gmem<float2> - Bandwidth:	97.900049 GiB/s

read_only_gmem<float4> - Bandwidth:	52.964178 GiB/s

read_only_tex<float> - Bandwidth:	69.641488 GiB/s

read_only_tex<float2> - Bandwidth:	109.820544 GiB/s

read_only_tex<float4> - Bandwidth:	105.827061 GiB/s

Huh, so even with the new coalescing rules, chars are still 4x slower than floats. This makes it sound like the minimum memory transaction size is 64 bytes, since only 16 bytes can be requested by a half-warp of char reads. Reading the programming guide would lead one to believe that 32-byte transactions are possible, but maybe something else is going on here…

Well it would be on a 285 with a 512-bit (64 byte) interface to global memory. The reason this is interesting is that the rates and fractions change depending on that bus width, so it would be different on an 8800 GTX. Basically the main reason is I’m asking this is I’ve come to the point where I’ve half implemented my application and I’m very very clearly bottlenecked by uncoalesced writes to global memory that are scattered (definitely no possible way to coalesce these) and I’m wondering if there is any possible way to mitigate that.

Mine is a rendering application and I’ve done the math and it looks like the performance impact of scattered uncoalesced writes for a rasterization based approach to rendering will beat the the performance gain of perfectly coalesced memory access in a raytracing style of rendering simply because the actual instruction count wasted is vastly vastly lower, and no expensive and complex optimization structures are required for accelerate rendering (which in turn would need to be updated in a dynamic scene with animated geometry).

The idea I’m currently playing with is is seeing if I can at least attempt to used shared memory in each threadblock as an intermediate framebuffer for a localized section of the screen, writes outside this area if they occur would be to the global memory framebuffer, but inside would much faster, and after I was done I could then write the whole block into the framebuffer coalesced. The problem is of course is that 16KB is tiny. Extremely tiny. Extremely extremely extremely tiny. In an absolute best case scenario I could use this for a 64x64 pixel region, more likely a 32x32 pixel region. Are you listening nvidia? 16kb is not enough for anybody! And please while you are at it the 32 thread warp size should come down too.