Per-Thread Repeated Access into Small Shared Float Array

I’m working on a project which centers around generating points on an IFS fractal.
Each point is generated by following a random trajectory for n steps, where each step of the trajectory moves the current point halfway to a randomly chosen vertex.
I want to generate as many points as I can in parallel over a duration.
My current plan is to run only 1 warp per SM for the entire duration of the computation, since only 1 warp can be run at time in a given SM.
Within each warp, I plan to store the vertices that can be chosen in shared_memory so that they can be accessed quickly by all the threads. Since a float is 32 bits, each memory bank should only contain 1 address, so there should be no bank conflicts.
What I’m less sure of is the best way to repeatedly generate random indices in each thread. I think I should have 1 curand_state per thread, with all the states in a run having the same seed and a sequence number based on the block/thread id. However, I’m not sure how I should store the curand state. If I don’t put the state in shared memory but instead just put on the device stack, will that slow me down if I need to use the state to generate a distribution every iteration in a thread? Instead, should I declare the curand states into shared memory as well? I’ve read that local memory is quite slow so I’m not sure what to do. Also, if I’m dynamically allocating the floats into shared memory, can I still statically allocate the curand_states into shared memory at the same time?
Thanks for the help.

Possibly, you are confused about how GPUs work. That statement is incorrect. Furthermore, one warp per SM will yield dismal performance on a GPU, most likely.

I think you may be misunderstanding what I’m saying. I’m not saying that you can’t associate multiple warps with a SM, I’m saying that at any given moment, a SM can only run 1 32 thread warp at a time.
See here: http://cuda-programming.blogspot.com/2013/01/what-is-warp-in-cuda.html

“The warp size is the number of threads running concurrently on an MP. In actuality, the threads are running both in parallel and pipelined. At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles. Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently.”

Also, my GPU clearly has 100% utilization when I run it across all my cores so I can’t imagine that I’d get more performance by increasing the warps per thread block.

My GPU is a gtx 1080 ti.

EDIT: Turns out I was wrong! I was looking at documentation for nvidia compute 3.x, but the gtx 1080ti has compute compability 6.1. For 6.1, there are 28 SM’s with 4 warp schedulers each, so using 28 thread blocks with 128 threads each seems to yield optimal performance.

even cc3.0 GPUs have multiple warp schedulers per SM. If not, it would be absolutely illogical to put 192 SP cores in a kepler SM.

Furthermore, all warp schedulers from cc3.0 forward to cc6.x are dual issue capable.

Furthermore, 100% utilization, if you are using nvidia-smi to discover that metric, tells you very little about how efficiently you are using a GPU. The assumption that once nvidia-smi reports 100% utilization means there is no more performance to be had is a completely broken idea.

https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#40938696

Let’s be clear:

in any given clock cycle, a cc3.0 or later SM can issue multiple instructions, from a single or multiple warps.

Furthermore, there is a good reason to associate more warps than can theoretically be scheduled. Sooner or later one or more of those “theoretical maximum schedulable” warps are going to stall. In that situation, its nice for the SM to have other ready warps to go to.

One warp per SM is not a good idea, performance wise.

I tried using 8 warps per thread block (2 warps per warp scheduler) and the performance difference was pretty small. 128 threads took 397 seconds and 256 threads took 387 seconds, so it was a 2.5% improvement. It’s better but not by much.

Does this still hold for cc7.0 and 7.5? I thought Volta and Turing switched to a single dispatch unit per scheduler?

The volta whitepaper says the design includes one scheduler and two dispatch units per SM.

The turing SM only has a single dispatch unit.

I modified my previous comment.

Doesn’t the Volta whitepaper say one dispatch unit per SM?

Yes, another correction. Sorry.