Optimizing Chunk-Based Memory Access and Circular Shift in CUDA C

I am working on a CUDA C application where I need to access and process data stored in chunks of 352 elements, with each element being a 16-bit signed integer. The data resides in global memory, and I need to efficiently:

Load up to 20 chunks into faster memory (registers/shared memory).
Perform a circular shift on each chunk before further processing.
Store the processed chunks back into global memory or use them in further computation.
Challenges in My Current Approach:
Memory Access:
Each chunk’s first element starts at a random index (0–351), meaning each chunk needs to be realigned via a circular shift.
I use 20 threads (potentially across different warps) to load 20 chunks into registers, which might not be optimal.

Register and Shared Memory Usage:
Storing full chunks in registers could lead to register spilling.
Using shared memory for realignment might introduce bank conflicts if not handled properly.
Thread Utilization:
Using 20 threads for 20 chunks might lead to warp divergence and non-coalesced memory access, which could degrade performance.

Questions:
1.What is the best way to load 20 chunks into fast memory (registers/shared memory) efficiently?
2.Would a warp-cooperative approach (one warp per chunk) improve performance compared to using 20 separate threads?
3.How can I efficiently perform the circular shift in shared memory while minimizing bank conflicts?
4.Are there alternative memory layouts that could make this access pattern more efficient?
I am running this on an RTX 4090.
Any insights or best practices for optimizing memory access and computation for this use case would be greatly appreciated!

Thank you

Hi,
that is doable.
With circular shift you mean some elements from the front go to the back?

How is it related to the alignment? By how many elements do you want to shift or rotate in what case?

Normally you should use one warp (32 threads) for each chunk.

The memory system uses a unit of 32 bytes.

One chunk is 22 * 32 bytes.

But the chunk could be unaligned (not at a 32 bytes boundary).

Let’s use a bandwith meant for 24 * 32 bytes.

We round down the index to a multiple of 32 bytes (index divisible by 16, as each element has 2 bytes).

Then each thread loads 24 bytes with 6, 3 or 2 load commands (6 * 4, 3 * 8, 16 + 8) - into short2 or short4 or uint4 (as there is no short8).

Each load command loads a contiguous amount of memory, so using several load commands means the individual threads get data with gaps, as other threads have loaded the data in between. So which load commands to choose depends on how you want to shift or further process.

The warps can exchange data with shuffle, they can use a switch case depending on alignment, inside the threads you can use permute to recombine the shorts within 4 bytes memory.

Shared memory can be used, but is probably not even necessary.

If you directly process the read values with extensive compute than the 8% compute loss by using 24 * 32 bytes vs. 22 * 32 bytes could be undesired and (parts of) more than one chunk would be computed together per warp to also use the 8%.

But probably (as most) your algorithm is memory-bound.
Then a theoretical optimization would be 4%: 23 * 32 bytes vs. 24 * 32 bytes. But would make the data shuffling and chunk distribution more complicated. And it would only help with L1 speed.

1 Like