When I load data from gmem to register, what will the warp do? For example, for 32 threads, each load a uint3 element, these element are consectively restored in gmem. (int *A[0 - 95]), will the memory access be coalesced?
From the hardware side:
The access requirements for coalescing depend on the compute capability of the device and are documented in the CUDA C++ Programming Guide.
For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.
When consecutive threads access consecutive elements in memory, the access is coalesced and the number of 32-byte transactions is minimized.
So will access to element in uint3 significient better then use int for 3 times?
The stride size is 3 bytes or 4 bytes (every 4 bytes store 3 uints with 0 padding in the 4th)?
There is no native 12 byte (96 bits) access of memory, only 32, 64 or 128 bits per thread.
So either you (or the compiler)
- reads 8 bytes with one instruction (or 4+4 bytes with two instructions) and 4 bytes with another instruction (having gaps in the accesses, but trusting the L1 to fill those), no coalescing
- you store the 3 individual uint components in 3 separate arrays and use 3 instructions, as you mentioned as an alternative
- you use 4 byte strides in your memory layout, padding with 0 each 4th uint element, wasted bandwidth and storage
- you use the following scheme (if the memory layout is fixed and you want to have - not only for L2, but also for L1 and local LD/ST units - good memory performance):
Each group of 4 threads need consecutive 48 bytes (4 threads * 3 * 4 bytes). Those are 3 uint4.
You reinterpret your uint3 array as 128 bits of uint4 and read
- the first of the three uint4 with thread 0
- the second of the three with thread 1
- the second of the three with thread 2
- the third of the three with thread 3
- then you shuffle (with one instruction) the missing uint from thread 0 to thread 1 and from thread 3 to thread 2
- you resort the uints (depending on thread number) locally that you have the correct uint3 for each thread
So nearly perfect memory accesses (one warp gets 12 aligned uint4, which is 3 * 32 byte sectors) + a single shuffle instruction + a bit of local select arithmetic instructions, which are ‘free’ for most kernels.
The data are store in gmem in this way, it is read only data
int A[96]
for a single warp, thread i need A[i * 3 + 0], A[i * 3 + 1], A[i * 3 + 2]. So maybe your final method will be the best?
-
The first method (either manually or automatically by the compiler when accessing uint3?) would be ‘dirty’ (non-coalesced), but fast enough, as it mostly slows down LD/ST units and L1.
-
Method 2 and 3 would not work (different layout).
-
Method 4 would work (you can put it into an inline device function to keep your code neat).
-
Another method (5) would be to load into shared memory and then back, but would be less efficient for LD/ST units and shared memory use.
If you compare method 1 and method 4 for the LD/ST units (I think shuffle also occupies them):
Method 1: 3 instructions (3 * 32 bits), 3 32-bit values
Method 4: 2 instructions (1 * 128 bit + 1 * 32 bit shuffle)
Method 5: 3 instructions (1 * 128 bit global load + 1 * 128 bit shared store + 3 * 32 bit shared load)
I would guess, either method 1 or method 4 would be fastest and use least resources.
The number of instructions I think goes into the pipeline (length) and the actual number of 32-bit transactions occupies for cycles.