When reading scattered data for a single warp in CUDA, how can we achieve coalesced memory access?

SparkHu · February 21, 2024, 8:33am

In the program I’m developing, each thread needs to read a structure of size 144 bytes from an array occupying several gigabytes of memory. This structure consists of 36 integers. Due to algorithmic constraints, it’s difficult to arrange the data each thread in the same warp needs to read into contiguous memory regions, or to group threads needing to read contiguous regions into the same warp. Therefore, I’m facing a common issue: how to achieve coalesced memory access when a warp needs to read scattered data from global memory?

GPU: 3080
Cuda: 12.3

striker159 · February 21, 2024, 8:49am

You could read each individual object coalesced using a warp and store it to shared memory. This should be the most efficient way. Why is this not possible for you?

SparkHu · February 21, 2024, 10:39am

When attempting to utilize shared memory for achieving coalesced memory access, I encountered some issues. I wanted to leverage the new feature of asynchronous copy introduced in the Ampere architecture. However, the following code resulted in an illegal memory access error.

asm volatile("cp.async.cg.shared.global [%0], [%1], 16;\n\t"
                    ::"l"(warp_smem_base_ptr + smem_offset), "l"(ptr + u_index):"memory");

Yet, if I replace this portion of the code with the following snippet, everything works fine.

warp_smem_base_ptr[smem_offset] = *(ptr + u_index)

warp_smem_base_ptr is of type uint128_t *.
ptr is of type const uint128_t *.
Do you know why?

striker159 · February 21, 2024, 11:15am

compute-sanitizer should give you the reason for the illegal memory access. (out of bounds or misaligned access).

I would assume that the ptx instruction has the same limitations as using cooperative_groups::memcpy_async with cuda::aligned_size_t. Both input and output pointer must be aligned to 16 bytes to allow transfers of 16 bytes.

SparkHu · February 22, 2024, 3:26am

The compute-sanitizer reports error:

Invalid shared write of size 4 bytes.
by thread (0,0,0) in block (0,0,0)
Address 0x5000000 is out of bounds.

This is odd because warp_smem_base_ptr[smem_offset] = XX did not report a write error.

In addition, before testing, I made a few minor modifications to the code as follows:

asm volatile("cp.async.ca.shared.global [%0], [%1], 4;\n\t"
        ::"l"(warp_smem_base_ptr + smem_offset), "l"(ptr + u_index):"memory");//Error
//warp_smem_base_ptr[smem_offset] = *(ptr + u_index); //No error

striker159 · February 22, 2024, 5:31am

PTX docs say:

6.4.1.1. Generic Addressing 

If a memory instruction does not specify a state space, the operation is performed using generic addressing.

And for cp-async:

Operand src specifies a location in the global state space and dst specifies a location in the shared state space.

Try converting the generic pointer into shared memory space before passing it to cp.async.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#address-space-conversion-functions

Curefab · March 1, 2024, 12:01am

For best performance, try to align at least to 32 byte boundaries, e.g. 160 bytes.

system · March 15, 2024, 12:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with coalesced memory access CUDA Programming and Performance	2	2767	June 23, 2008
Another question about coalesced reads/writes CUDA Programming and Performance	10	2128	August 18, 2009
Memory, Structs, arrays, etc... CUDA Programming and Performance	0	2284	October 1, 2009
Practical rules for coalesced memory access ? CUDA Programming and Performance	4	5542	September 13, 2008
coalesced access to global memory CUDA Programming and Performance	6	1157	May 8, 2014
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1163	March 30, 2009
Accessing same global memory address within warps CUDA Programming and Performance	4	4047	October 24, 2018
Loading global memory into shared memory: alignment? CUDA Programming and Performance	2	836	December 8, 2017
Access Global memory from kernel CUDA Programming and Performance cuda	2	627	December 15, 2020
Bytes in shared memory CUDA Programming and Performance	8	3006	April 19, 2017

When reading scattered data for a single warp in CUDA, how can we achieve coalesced memory access?

Related topics