In the program I’m developing, each thread needs to read a structure of size 144 bytes from an array occupying several gigabytes of memory. This structure consists of 36 integers. Due to algorithmic constraints, it’s difficult to arrange the data each thread in the same warp needs to read into contiguous memory regions, or to group threads needing to read contiguous regions into the same warp. Therefore, I’m facing a common issue: how to achieve coalesced memory access when a warp needs to read scattered data from global memory?
You could read each individual object coalesced using a warp and store it to shared memory. This should be the most efficient way. Why is this not possible for you?
When attempting to utilize shared memory for achieving coalesced memory access, I encountered some issues. I wanted to leverage the new feature of asynchronous copy introduced in the Ampere architecture. However, the following code resulted in an illegal memory access error.
compute-sanitizer should give you the reason for the illegal memory access. (out of bounds or misaligned access).
I would assume that the ptx instruction has the same limitations as using cooperative_groups::memcpy_async with cuda::aligned_size_t. Both input and output pointer must be aligned to 16 bytes to allow transfers of 16 bytes.