I am working on a CUDA C application where I need to access and process data stored in chunks of 352 elements, with each element being a 16-bit signed integer. The data resides in global memory, and I need to efficiently:
Load up to 20 chunks into faster memory (registers/shared memory).
Perform a circular shift on each chunk before further processing.
Store the processed chunks back into global memory or use them in further computation.
Challenges in My Current Approach:
Memory Access:
Each chunk’s first element starts at a random index (0–351), meaning each chunk needs to be realigned via a circular shift.
I use 20 threads (potentially across different warps) to load 20 chunks into registers, which might not be optimal.
Register and Shared Memory Usage:
Storing full chunks in registers could lead to register spilling.
Using shared memory for realignment might introduce bank conflicts if not handled properly.
Thread Utilization:
Using 20 threads for 20 chunks might lead to warp divergence and non-coalesced memory access, which could degrade performance.
Questions:
1.What is the best way to load 20 chunks into fast memory (registers/shared memory) efficiently?
2.Would a warp-cooperative approach (one warp per chunk) improve performance compared to using 20 separate threads?
3.How can I efficiently perform the circular shift in shared memory while minimizing bank conflicts?
4.Are there alternative memory layouts that could make this access pattern more efficient?
I am running this on an RTX 4090.
Any insights or best practices for optimizing memory access and computation for this use case would be greatly appreciated!
Thank you