float4 bandwidth advantages over plain float1

It’s not too hard to max-out the bandwidth of a device to pretty close to theoretical limits by having every warp reading a single sequential word from device memory, all aligned to a nice boundary of an even 32 words.

But why then is there a lot of code that carefully reads groups of float4 or int4 per thread to get max bandwidth, instead of one word per thread?

I think I understand the answer but the programming guide is silent, and I got most of the hints from other posts here on the forum, so I want to make sure I understand.

My impression, and I want to be corrected if wrong, is that maybe there actually IS no bandwidth improvement in reading float4 vs single floats. The reason float4 reads are superior (when applicable) is that they use fewer instructions by queuing up 4 words per thread at once. That may not be important if the kernel is just doing a memory copy, but IS a savings if its doing other compute since its basically 3 free instructions saved.

Is my understanding correct? Or is there some other advantages?

There used to be a sizeable performance advantage from choosing wider accesses with older GPU architectures. This was at least in part due to the fact that the load/store queues could track a fixed number of memory accesses. So with each access being wider, the total amount of memory traffic (in bytes) being able to be queued up would increase. The widest native access instructions on the GPU are 128-bit loads / stores, motivating the use of int4 / float4 / double2 to maximize memory throughput.

I am reasonably certain that starting with the Maxwell generation of GPUs the performance benefit of wider accesses disappeared, due to significant changes in how the memory hierarchy was implemented in hardware. There may be a residual effect from wider accesses leading to a reduction in static and dynamic instruction count, however, this is probably counterbalanced by wider accesses tending to increase register pressure.

^what he said

My $0.02:

I think you have a pretty good grasp of it.

When all threads are just banging on the memory subsystem, as you’ve already indicated it takes minimal programming effort to saturate things. But there may be other situations where you only have certain threads reading at a particular time, or possibly some other scenarios, where the efficiency benefit is useful.

To expand on this a bit, one of the figures of merit for a CUDA program is how much parallel work is exposed (by the programmer, and subsequently by the compiler). This has to do with the instantaneous thread carrying capacity of a particular GPU, for which we can use 2048*number of SM’s as a proxy. To maximize the amount of memory traffic and general parallel work per thread, it may be useful to do vector loads/stores. Something similar can often be achieved with careful arrangement of loops to permit unrolling/reordering by the compiler, but that is sometimes easier said than done.

Another possible benefit (not directly in the scope of your question, perhaps) is that it allows a limited amount of AoS activity in a machine that otherwise strongly prefers SoA memory organization.

A “typical” recommendation for efficient CUDA programming is to reorganize AoS storage patterns into SoA storage patterns. This is such a common topic that I’ll skip repeating it here. You can easily google for writeups, or here is one:

https://stackoverflow.com/questions/42451832/cuda-profiler-reports-inefficient-global-memory-access/42451933#42451933

However in the case where your structure can be organized to fit into 16 bytes or 8 bytes per thread, then rather than do a full reorg to SoA type storage, you can use a vector operation to achieve perfect coalescing/full efficiency.

Thanks for the background! I never expected to be a historical effect, but the explanation makes sense.

Is there any rough idea how much impact it has on the bandwidth for Kepler devices? Ie, is it like a factor of 2x in practical bandwidth? Or kind of a 10% code-polishing effect only attractive to ninja perfectionists?

The dark recesses of my memory want to say that memory throughput as measured by STREAM-type benchmarks differed by up to ~ 15% when comparing access via 32-bit, 64-bit, and 128-bit loads.

Note that on those older architectures, accessing memory in chunks smaller than 32 bit could reduce memory throughput significantly. I am reasonably sure that on modern architecture, that, too, is no longer the case.

njuffa, thanks for that ballpark estimate. That means there’s no need to be concerned about losing a lot of performance by using simpler word reads, especially on the last 3 generations of GPUs.

I would concur, with possible exception of ninja-level optimizations under consideration of txbob’s remarks in #3.

The positive take-home message is that improvements in recent GPU architectures have improved the life of CUDA programmers quite significantly when it comes to performance optimizations. It is much easier to get performance out of the hardware, and the CUDA profiler has improved quite a bit as well.

Now, if the hardware architects could fix the limited number of bits in the performance counter registers, we might be in fat city :-)