Relationship between CUDA and GPU Memory Bus Width

yashiz · December 18, 2017, 9:06am

Hi,

Does GPU Memory Bus Width have anything to do with the best data alignment in CUDA? GTX 1080, GTX 1080 TI and Titan V have Memory Bus Width 256-bit, 352-bit and 3072-bit respectively. Does it make any difference in data transfer in terms of a CUDA program?

I heard that, to achieve best performance, it is better to load data as float4 instead of float3, as float4 is 128 bit alignment. It seems easier to understand that the best data alignment can be 128-bit with Bus Width, 64-bit, 128-bit and 256-bit. However, how this works with Bus Width 352-bit and 3072-bit?

Maybe I mess up the concepts, but if GPU Memory Bus Width makes any sense to a CUDA developer, please let me know.

Thank you in advance for any answers/catch ups,

Yashiz

Robert_Crovella · December 18, 2017, 2:42pm

No, GPU memory bus width has no effect on alignment or the use of float4 or any other vector type.

njuffa · December 18, 2017, 6:34pm

In addition:

(1) The GPU requires that all memory accesses are naturally aligned, i.e. alignment is equal to the access width

(2) Wider accesses utilize hardware resources more efficiently. In particular, the load/store queue which tracks outstanding memory accesses is a hardware resource with a finite number of entries. Wider access means each entry covers more bytes, i.e. the total number of “bytes in flight” tracked is increased. Accessing a float3 (not a built-in CUDA type, best I know) operand requires three 32-bit accesses, accessing a float4 (a built-in CUDA type) requires a single 128-bit access.

(3) The GPU hardware only supports accesses comprising power-of-two bits, up to 128 bits, so data types matching the hardware capability will be more efficient. CUDA provides a bunch of built-in short-vector integer and floating-point types for this reason.

yashiz · December 19, 2017, 8:46am

Thank you for the answers, txbob and njuffa.

Topic		Replies	Views
float4 bandwidth advantages over plain float1 CUDA Programming and Performance	6	3045	July 2, 2018
Alignment Requirement Single instructions CUDA Programming and Performance	1	3725	October 11, 2007
Preferred alignment for buffers OptiX	5	1810	June 14, 2022
Cu_device_attribute_global_memory_bus_width CUDA Programming and Performance gpu	6	1065	February 23, 2021
Coalesced memory reads and writes and shared memory on Fermi CUDA Programming and Performance	1	1962	February 2, 2010
Global memory access for float and integer, the speed is the same, right ? CUDA Programming and Performance	2	3110	March 3, 2012
Require clarification for Memory coalescing? CUDA Programming and Performance hw , cuda	4	2442	October 5, 2023
CUDA Fortran + float3/float4 Legacy PGI Compilers	4	5303	April 13, 2011
Understanding misaligned access pattrerns CUDA Programming and Performance	2	124	October 12, 2024
Why GPU has large memory bandwidth than CPU? CUDA Programming and Performance	3	10818	June 21, 2009

Relationship between CUDA and GPU Memory Bus Width

Related topics