When reading scattered data for a single warp in CUDA, how can we achieve coalesced memory access?

In the program I’m developing, each thread needs to read a structure of size 144 bytes from an array occupying several gigabytes of memory. This structure consists of 36 integers. Due to algorithmic constraints, it’s difficult to arrange the data each thread in the same warp needs to read into contiguous memory regions, or to group threads needing to read contiguous regions into the same warp. Therefore, I’m facing a common issue: how to achieve coalesced memory access when a warp needs to read scattered data from global memory?

GPU: 3080
Cuda: 12.3

You could read each individual object coalesced using a warp and store it to shared memory. This should be the most efficient way. Why is this not possible for you?

When attempting to utilize shared memory for achieving coalesced memory access, I encountered some issues. I wanted to leverage the new feature of asynchronous copy introduced in the Ampere architecture. However, the following code resulted in an illegal memory access error.

asm volatile("cp.async.cg.shared.global [%0], [%1], 16;\n\t"
                    ::"l"(warp_smem_base_ptr + smem_offset), "l"(ptr + u_index):"memory");

Yet, if I replace this portion of the code with the following snippet, everything works fine.

warp_smem_base_ptr[smem_offset] = *(ptr + u_index)

warp_smem_base_ptr is of type uint128_t *.
ptr is of type const uint128_t *.
Do you know why?

compute-sanitizer should give you the reason for the illegal memory access. (out of bounds or misaligned access).

I would assume that the ptx instruction has the same limitations as using cooperative_groups::memcpy_async with cuda::aligned_size_t. Both input and output pointer must be aligned to 16 bytes to allow transfers of 16 bytes.

The compute-sanitizer reports error:

Invalid shared write of size 4 bytes.
by thread (0,0,0) in block (0,0,0)
Address 0x5000000 is out of bounds.

This is odd because warp_smem_base_ptr[smem_offset] = XX did not report a write error.

In addition, before testing, I made a few minor modifications to the code as follows:

asm volatile("cp.async.ca.shared.global [%0], [%1], 4;\n\t"
        ::"l"(warp_smem_base_ptr + smem_offset), "l"(ptr + u_index):"memory");//Error
//warp_smem_base_ptr[smem_offset] = *(ptr + u_index); //No error

PTX docs say:

6.4.1.1. Generic Addressing

If a memory instruction does not specify a state space, the operation is performed using generic addressing.

And for cp-async:

Operand src specifies a location in the global state space and dst specifies a location in the shared state space.

Try converting the generic pointer into shared memory space before passing it to cp.async.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#address-space-conversion-functions

1 Like

For best performance, try to align at least to 32 byte boundaries, e.g. 160 bytes.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.