float4 bandwidth advantages over plain float1

nvredsox · July 2, 2018, 12:15am

It’s not too hard to max-out the bandwidth of a device to pretty close to theoretical limits by having every warp reading a single sequential word from device memory, all aligned to a nice boundary of an even 32 words.

But why then is there a lot of code that carefully reads groups of float4 or int4 per thread to get max bandwidth, instead of one word per thread?

I think I understand the answer but the programming guide is silent, and I got most of the hints from other posts here on the forum, so I want to make sure I understand.

My impression, and I want to be corrected if wrong, is that maybe there actually IS no bandwidth improvement in reading float4 vs single floats. The reason float4 reads are superior (when applicable) is that they use fewer instructions by queuing up 4 words per thread at once. That may not be important if the kernel is just doing a memory copy, but IS a savings if its doing other compute since its basically 3 free instructions saved.

Is my understanding correct? Or is there some other advantages?

njuffa · July 2, 2018, 1:08am

There used to be a sizeable performance advantage from choosing wider accesses with older GPU architectures. This was at least in part due to the fact that the load/store queues could track a fixed number of memory accesses. So with each access being wider, the total amount of memory traffic (in bytes) being able to be queued up would increase. The widest native access instructions on the GPU are 128-bit loads / stores, motivating the use of int4 / float4 / double2 to maximize memory throughput.

I am reasonably certain that starting with the Maxwell generation of GPUs the performance benefit of wider accesses disappeared, due to significant changes in how the memory hierarchy was implemented in hardware. There may be a residual effect from wider accesses leading to a reduction in static and dynamic instruction count, however, this is probably counterbalanced by wider accesses tending to increase register pressure.

Robert_Crovella · July 2, 2018, 1:14am

^what he said

My $0.02:

I think you have a pretty good grasp of it.

When all threads are just banging on the memory subsystem, as you’ve already indicated it takes minimal programming effort to saturate things. But there may be other situations where you only have certain threads reading at a particular time, or possibly some other scenarios, where the efficiency benefit is useful.

To expand on this a bit, one of the figures of merit for a CUDA program is how much parallel work is exposed (by the programmer, and subsequently by the compiler). This has to do with the instantaneous thread carrying capacity of a particular GPU, for which we can use 2048*number of SM’s as a proxy. To maximize the amount of memory traffic and general parallel work per thread, it may be useful to do vector loads/stores. Something similar can often be achieved with careful arrangement of loops to permit unrolling/reordering by the compiler, but that is sometimes easier said than done.

Another possible benefit (not directly in the scope of your question, perhaps) is that it allows a limited amount of AoS activity in a machine that otherwise strongly prefers SoA memory organization.

A “typical” recommendation for efficient CUDA programming is to reorganize AoS storage patterns into SoA storage patterns. This is such a common topic that I’ll skip repeating it here. You can easily google for writeups, or here is one:

[url]caching - CUDA profiler reports inefficient global memory access - Stack Overflow

However in the case where your structure can be organized to fit into 16 bytes or 8 bytes per thread, then rather than do a full reorg to SoA type storage, you can use a vector operation to achieve perfect coalescing/full efficiency.

nvredsox · July 2, 2018, 4:23am

Thanks for the background! I never expected to be a historical effect, but the explanation makes sense.

Is there any rough idea how much impact it has on the bandwidth for Kepler devices? Ie, is it like a factor of 2x in practical bandwidth? Or kind of a 10% code-polishing effect only attractive to ninja perfectionists?

njuffa · July 2, 2018, 5:35am

The dark recesses of my memory want to say that memory throughput as measured by STREAM-type benchmarks differed by up to ~ 15% when comparing access via 32-bit, 64-bit, and 128-bit loads.

Note that on those older architectures, accessing memory in chunks smaller than 32 bit could reduce memory throughput significantly. I am reasonably sure that on modern architecture, that, too, is no longer the case.

nvredsox · July 2, 2018, 5:41pm

njuffa, thanks for that ballpark estimate. That means there’s no need to be concerned about losing a lot of performance by using simpler word reads, especially on the last 3 generations of GPUs.

njuffa · July 2, 2018, 5:49pm

I would concur, with possible exception of ninja-level optimizations under consideration of txbob’s remarks in #3.

The positive take-home message is that improvements in recent GPU architectures have improved the life of CUDA programmers quite significantly when it comes to performance optimizations. It is much easier to get performance out of the hardware, and the CUDA profiler has improved quite a bit as well.

Now, if the hardware architects could fix the limited number of bits in the performance counter registers, we might be in fat city :-)

Topic		Replies	Views
Why using vectorized loads is more efficient? CUDA Programming and Performance	8	3953	September 6, 2024
why load vector4 not faster than single load? CUDA Programming and Performance	8	860	April 2, 2019
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4929	January 7, 2008
float4 in a register? CUDA Programming and Performance	4	2057	February 5, 2015
benefit of using int4 and float4 instead of int and float what's the benefit CUDA Programming and Performance	1	4892	February 10, 2012
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13804	June 2, 2008
Relationship between CUDA and GPU Memory Bus Width CUDA Programming and Performance	3	2185	December 19, 2017
Vehicle Routing Problem with CUDA CUDA Programming and Performance	18	4554	January 14, 2010
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4846	June 10, 2008
Will compiler optimise these memory accesses CUDA Programming and Performance	3	788	July 11, 2013

float4 bandwidth advantages over plain float1

Related topics