GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

Originally published at:

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of factors. In this post I’ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases. Static indexing Before discussing dynamic indexing let’s…

Thanks much for this Maxim! I was thinking about it for a many-small array project I am thinking of. There are several experiments and opinions about occupancy/registers/shared memory/spills around and your post is definitely fresh stuff.

Two comments (Let me tell you I am still educating myself about CUDA).

1) In your first example of dynamical rays, you write: i * a[start_index + i], while in the assembler instructions, there is no multiplication by "i". If the "i*" is kept, does the uniform access still hold?

2) In your no-bank-conflict indexing, one creates a new array "A". In your example, I do not see how "A" is actually fed with its actual values. I imagine there is an allocation time.

I wonder whether you could post your actual codes to better understand what and how (I found this very useful from other Mark Harris posts). I am really interested to see if this can help me.

Hi Sergi,

I am glad you found this post useful!

1) You are perfectly right. I changed the source code while looking for neat SASS and forgot to change it here. Fixed now.

2) Yes, I omitted setting A from the listing for the clarity.

Here is the full source code of all the experiments I ran:

Hello Maxim! Thanks for this useful post. I have a comment. The need for the helper function named `no_bank_conflict_index` can be avoided by allocating `A` 2 dimensional. Knowing that linearizing a multi-dimensional representation of the memory starts from the right most dimension, using below declaration of `A`

__shared__ float A[ ARRAY_SIZE ][ THREADBLOCK_SIZE ];

`val` can be retrieved with:

float val = A[ index ][ threadIdx.x ];

Although both versions will probably results in the same machine code and have the same performance, this representation might be better for clarity.

Hello and Thanks to your post that is so useful.

I have one question about below sentence.

"Approximately 2.5 replays on average when index is an independent random variable with uniform distribution from 0 to 31;"

What is the reason that replay number is 2.5...

Hi Farzad, nice trick! This should definitely work.

Hi! I did a Monte Carlo experiment with each thread accessing random location with uniform distribution, and I got ~3.5 shared memory banks accessed. Thus we have 1 "normal" access and 2.5 replays (in average).

Amazing trick, and still work for very well for Pascal (GP102)
Thanks for sharing!

what is meaning of "replay" in section "Dynamic Indexing with Non-Uniform Access" ?
there are 2 lines
A: int index = indexbuf[threadIdx.x + blockIdx.x * blockDim.x];
B: float val = a[index];
which line will be replayed, A or B?
from the context, I guess "replay load/store" means threads need to read from different location. if all threads in a warp read the same location, we do not need replay.
But I still have a question:
all threads in a warp have same "index", but they still need to read "val" from different location of local memory, because their "a" is different.