Usage of uint3

half-0 · April 23, 2025, 8:51am

When I load data from gmem to register, what will the warp do? For example, for 32 threads, each load a uint3 element, these element are consectively restored in gmem. (int *A[0 - 95]), will the memory access be coalesced?

striker159 · April 23, 2025, 10:16am

From the hardware side:

The access requirements for coalescing depend on the compute capability of the device and are documented in the CUDA C++ Programming Guide.

For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.

When consecutive threads access consecutive elements in memory, the access is coalesced and the number of 32-byte transactions is minimized.

half-0 · April 23, 2025, 10:20am

So will access to element in uint3 significient better then use int for 3 times?

Curefab · April 23, 2025, 11:39am

The stride size is 3 bytes or 4 bytes (every 4 bytes store 3 uints with 0 padding in the 4th)?

There is no native 12 byte (96 bits) access of memory, only 32, 64 or 128 bits per thread.

So either you (or the compiler)

reads 8 bytes with one instruction (or 4+4 bytes with two instructions) and 4 bytes with another instruction (having gaps in the accesses, but trusting the L1 to fill those), no coalescing
you store the 3 individual uint components in 3 separate arrays and use 3 instructions, as you mentioned as an alternative
you use 4 byte strides in your memory layout, padding with 0 each 4th uint element, wasted bandwidth and storage
you use the following scheme (if the memory layout is fixed and you want to have - not only for L2, but also for L1 and local LD/ST units - good memory performance):

Each group of 4 threads need consecutive 48 bytes (4 threads * 3 * 4 bytes). Those are 3 uint4.

You reinterpret your uint3 array as 128 bits of uint4 and read

the first of the three uint4 with thread 0
the second of the three with thread 1
the second of the three with thread 2
the third of the three with thread 3
then you shuffle (with one instruction) the missing uint from thread 0 to thread 1 and from thread 3 to thread 2
you resort the uints (depending on thread number) locally that you have the correct uint3 for each thread

So nearly perfect memory accesses (one warp gets 12 aligned uint4, which is 3 * 32 byte sectors) + a single shuffle instruction + a bit of local select arithmetic instructions, which are ‘free’ for most kernels.

half-0 · April 23, 2025, 11:53am

The data are store in gmem in this way, it is read only data
int A[96]

for a single warp, thread i need A[i * 3 + 0], A[i * 3 + 1]， A[i * 3 + 2]. So maybe your final method will be the best?

Curefab · April 23, 2025, 12:19pm

The first method (either manually or automatically by the compiler when accessing uint3?) would be ‘dirty’ (non-coalesced), but fast enough, as it mostly slows down LD/ST units and L1.
Method 2 and 3 would not work (different layout).
Method 4 would work (you can put it into an inline device function to keep your code neat).
Another method (5) would be to load into shared memory and then back, but would be less efficient for LD/ST units and shared memory use.

If you compare method 1 and method 4 for the LD/ST units (I think shuffle also occupies them):

Method 1: 3 instructions (3 * 32 bits), 3 32-bit values
Method 4: 2 instructions (1 * 128 bit + 1 * 32 bit shuffle)
Method 5: 3 instructions (1 * 128 bit global load + 1 * 128 bit shared store + 3 * 32 bit shared load)

I would guess, either method 1 or method 4 would be fastest and use least resources.
The number of instructions I think goes into the pipeline (length) and the actual number of 32-bit transactions occupies for cycles.

Topic		Replies	Views
Practical rules for coalesced memory access ? CUDA Programming and Performance	4	5544	September 13, 2008
Coalesced Memory access related doubt CUDA Programming and Performance	13	2011	December 9, 2010
Bytes in shared memory CUDA Programming and Performance	8	3046	April 19, 2017
Coalesced Memory Read Question CUDA Programming and Performance	7	3065	February 24, 2016
Accessing same global memory address within warps CUDA Programming and Performance	4	4164	October 24, 2018
Memory Coalescing CUDA Programming and Performance	5	9273	October 15, 2011
About coalescing CUDA Programming and Performance	6	2619	April 16, 2010
Memory access should be coalesced but is not CUDA Programming and Performance	6	1066	May 16, 2019
Conditions of coalescing global memory into few transactions CUDA Programming and Performance	3	682	December 23, 2019
Using Shared Memory in CUDA C/C++ Technical Blog	36	1997	October 8, 2020

Usage of uint3

Related topics