GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

jwitsoe · February 11, 2015, 9:27am

Originally published at: https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of factors. In this post I’ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases. Static indexing Before discussing dynamic indexing let’s…

anon42620448 · February 14, 2015, 1:12am

Thanks much for this Maxim! I was thinking about it for a many-small array project I am thinking of. There are several experiments and opinions about occupancy/registers/shared memory/spills around and your post is definitely fresh stuff.

Two comments (Let me tell you I am still educating myself about CUDA).

1) In your first example of dynamical rays, you write: i * a[start_index + i], while in the assembler instructions, there is no multiplication by "i". If the "i*" is kept, does the uniform access still hold?

2) In your no-bank-conflict indexing, one creates a new array "A". In your example, I do not see how "A" is actually fed with its actual values. I imagine there is an allocation time.

I wonder whether you could post your actual codes to better understand what and how (I found this very useful from other Mark Harris posts). I am really interested to see if this can help me.

anon18618409 · February 16, 2015, 11:40am

Hi Sergi,

I am glad you found this post useful!

1) You are perfectly right. I changed the source code while looking for neat SASS and forgot to change it here. Fixed now.

2) Yes, I omitted setting A from the listing for the clarity.

Here is the full source code of all the experiments I ran: http://pastebin.com/bZjNZkpZ

anon81108252 · March 15, 2016, 7:22pm

Hello Maxim! Thanks for this useful post. I have a comment. The need for the helper function named `no_bank_conflict_index` can be avoided by allocating `A` 2 dimensional. Knowing that linearizing a multi-dimensional representation of the memory starts from the right most dimension, using below declaration of `A`

__shared__ float A[ ARRAY_SIZE ][ THREADBLOCK_SIZE ];

`val` can be retrieved with:

float val = A[ index ][ threadIdx.x ];

Although both versions will probably results in the same machine code and have the same performance, this representation might be better for clarity.

anon35948761 · April 28, 2016, 8:59am

Hello and Thanks to your post that is so useful.

I have one question about below sentence.

"Approximately 2.5 replays on average when index is an independent random variable with uniform distribution from 0 to 31;"

What is the reason that replay number is 2.5...

anon18796202 · April 29, 2016, 7:04am

Hi Farzad, nice trick! This should definitely work.

anon18796202 · April 29, 2016, 7:09am

Hi! I did a Monte Carlo experiment with each thread accessing random location with uniform distribution, and I got ~3.5 shared memory banks accessed. Thus we have 1 "normal" access and 2.5 replays (in average).

anon48477189 · September 9, 2017, 3:52am

Amazing trick, and still work for very well for Pascal (GP102)
Thanks for sharing!

anon68773096 · November 22, 2019, 1:12am

what is meaning of "replay" in section "Dynamic Indexing with Non-Uniform Access" ?
there are 2 lines
A: int index = indexbuf[threadIdx.x + blockIdx.x * blockDim.x];
B: float val = a[index];
which line will be replayed, A or B?
from the context, I guess "replay load/store" means threads need to read from different location. if all threads in a warp read the same location, we do not need replay.
But I still have a question:
all threads in a warp have same "index", but they still need to read "val" from different location of local memory, because their "a" is different.

Topic		Replies	Views
Constant Arrays CUDA Programming and Performance	13	30433	November 24, 2007
Register Cache: Caching for Warp-Centric CUDA Programs Technical Blog	3	621	January 31, 2024
Efficient use of shared memory CUDA Programming and Performance	29	4435	December 2, 2019
Using Shared Memory in CUDA C/C++ Technical Blog	36	1976	October 8, 2020
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4043	December 24, 2008
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4697	June 22, 2011
How to use all 16KB shared memory CUDA Programming and Performance	39	19309	April 1, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
CUDA Memory Consistency CUDA Programming and Performance	23	55517	March 8, 2007

GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

Related topics