CUDA Pro Tip: Increase Performance with Vectorized Memory Access

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it very important to take steps to mitigate bandwidth bottlenecks in your code. In this post, I will show you how to use vector loads and stores in CUDA C/C++ to…

Thank you for interesting material. But it's seems to be a mistake in the end of device_copy_vector2_kernel code.

Thanks Anton, I have fixed the formatting error.

The section above that (process remaining elements) dangles in mid air...

"Dereferencing those pointers will cause the compiler to generate the vectorized instructions."

I wonder where I can look for more on the conditions under which the compiler will take the initiative in this way, both for memory access and for local instructions? It seems a bit magical to a newbie.

Another way of putting it: I want to understand *why* casting to a pointer to type int2 triggers automatic vectorization. (Happy to do my own reading, if someone would be kind enough to give helpful indirection.)

This is fixed now too. Thanks!

The compiler cannot guarantee that the pointer is int2 or int4 aligned and thus cannot generate vectorized loads in the absence of additional information. By casting to a vector type the user is giving the compiler more information by explicitly saying it is aligned. Of course if the user is wrong about the alignment the code is likely to produce a memory access fault. The Maximum vectorized load/store is 16 bytes. Depending on the size of the struct the compiler can do some interesting things, for example on 24-byte structs it could load 8,8,8 bytes but not 16,8 (since the 16 would only be aligned every second struct element).

Dr Harris,

Thank you for your helpful simplified explanation. [...]

[I've deleted the rest of this post because my remaining questions (1) would be better asked in a different forum and (2) have shown up a blockage in my understanding that I need to address. The penny will eventually drop but in the meantime I don't want to bother others.]

[The penny has dropped. Thanks.]

quick question: how can I load four integers into registers (an array of four) which can be addressed by an index and not .x, .y, .z, .w?

>if(i < N)

this statement in the second vectorized variant should be replaced with for(;i < N;i++) since you may need to copy up to 3 remaining items

Is there a compute capability requirement? I'm on a Tesla C2070 with compute capability 2.0 and I don't see any LD.64 when I look at the SASS for device_copy_vector2_kernel, I just see regular LD's

The examples here use Kepler, which has a different SASS ISA than Fermi (CC 2.x). However, the approach of vectorizing loads and stores should be valuable on Fermi GPUs as well.

This is a pedantic question since I understand Luitjens intention/meaning but when he says "However, it is important to note that there will be half as many instructions executed because the loop only executes N/2 times" after he does the SASS/assembly of `device_copy_vector2`, does he mean that that only half of the *total* number of threads to be launched, N/2, is now needed? The for loop skips by += blockDim.x * gridDim.x so to process elements in the int array that's >= blockDim.x * gridDim.x (in its index). It's not += 1 (so for loop isn't looped over N/2 times).

..

If the for-loop that uses offsets were to be unrolled 2 times (for comparing with vec2) or 4 times (for comparing with vec4), would we still see a difference in performance? This is puzzling because I would have thought that although the offset version requires more instructions the additional unrolled instructions would be pipelined while waiting for the first load (or store) to complete -- and so the additional unrolled instructions would not present a performance penalty. (https://stackoverflow.com/q...

My guess: In the offset case, contiguous cache lines are "loaded/stored" to/from different warps and the cache lines a warp "loads/stores" are far apart in memory. In the vector case, a warp "loads/stores" contiguous cache lines. Stores involve read-write. These three attributes would tend to make the vector version perform better.

The blog post utilizes “type punning” which is Undefined Behavior as per the C/C++ Standard.

Do you know if there is any section in the CUDA documentation that explicitly states this particular type of aliasing is allowed and well-defined?

Thanks for the great post Justin. I have a question:

… and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels.

Doesn’t a kernel get closer to be compute-bound when arithmetic intensity (ratio of flops to bandwidth) increases? I understand the kernel moves to the right on a roofline diagram when AI increases.
Thanks

Hi, my comment on the flops to bandwidth ratio in the post is from the perspective of the hardware not the application. As your application increases it’s flops to bandwidth ratio your workflow becomes more compute bound. As the hardware increases its flops to bandwidth ratio your application becomes more bandwidth bound. I hope that clarifies the statement.

1 Like