Relationship between CUDA and GPU Memory Bus Width


Does GPU Memory Bus Width have anything to do with the best data alignment in CUDA? GTX 1080, GTX 1080 TI and Titan V have Memory Bus Width 256-bit, 352-bit and 3072-bit respectively. Does it make any difference in data transfer in terms of a CUDA program?

I heard that, to achieve best performance, it is better to load data as float4 instead of float3, as float4 is 128 bit alignment. It seems easier to understand that the best data alignment can be 128-bit with Bus Width, 64-bit, 128-bit and 256-bit. However, how this works with Bus Width 352-bit and 3072-bit?

Maybe I mess up the concepts, but if GPU Memory Bus Width makes any sense to a CUDA developer, please let me know.

Thank you in advance for any answers/catch ups,


No, GPU memory bus width has no effect on alignment or the use of float4 or any other vector type.

In addition:

(1) The GPU requires that all memory accesses are naturally aligned, i.e. alignment is equal to the access width

(2) Wider accesses utilize hardware resources more efficiently. In particular, the load/store queue which tracks outstanding memory accesses is a hardware resource with a finite number of entries. Wider access means each entry covers more bytes, i.e. the total number of “bytes in flight” tracked is increased. Accessing a float3 (not a built-in CUDA type, best I know) operand requires three 32-bit accesses, accessing a float4 (a built-in CUDA type) requires a single 128-bit access.

(3) The GPU hardware only supports accesses comprising power-of-two bits, up to 128 bits, so data types matching the hardware capability will be more efficient. CUDA provides a bunch of built-in short-vector integer and floating-point types for this reason.

Thank you for the answers, txbob and njuffa.