Three questions about register shuffle and shared memory

  1. I can’t seem to find any website/PDFs listing the time of a register shuffle operation, or how may cycles it should take - is there any info on this? (I assume 1-15 cycles, but I don’t know if there is an exact number).

  2. Also, if many warps in a threadblock (or SM) are constantly performing register shuffles, can they block on each other? Or does it effectively have unlimited bandwidth?

  3. If warps in a threadblock are attempting to read from shared memory, do they block on each others’ shared memory reads, or are these executed in parallel? (I’m asking this as nvvp is saying my program is shared memory limited when I don’t have any bank conflicts - maybe this is because it’s limiting occupancy to 50%, or perhaps bandwidth).

I asked a similar question about GMEM - and Robert Crovella explained that it operates analogous to a pipeline architecture (so requests do not block on each other) but I was wondering if the same was true for register-shuffles/SMEM.

Thanks.

NVIDIA generally doesn’t publish latencies. So you’re going to have to look for microbenchmarking studies by 3rd parties or random threads around the internet, for people who have tried to measure it. This is true for almost any latency, not just for warp shuffle. A quick search did not turn anything up for me. Throughput, on the other hand, is documented

The GPU is a latency hiding, throughput architecture, so for a well-designed code, throughputs are the predictor to performance.

I view this as a throughput question. Referring to my previous response, warp shuffle seems to proceed at the rate of 1 instruction/clock/sm. Therefore if all warps are presently wanting to issue a warp shuffle instruction, they are not all going to issue in the same clock cycle.

For shared memory, AFAIK neither bandwidth nor latency is documented. So I think in this case you can find microbenchmarking studies (e.g. chapter 3, page 19) that talk about it, and also it probably varies by architecture. It’s certainly the case that a program can be shared memory limited, and relieving shared pressure can therefore increase performance. Here is an example.

1 Like

Thanks for the reply, I think I was confusing bandwidth with throughput…

I did some testing with a kernel where I was storing floats in the registers, and threads would shuffle a float from the (i % 32)th thread, and the warp would refill the registers with a GMEM read every 32 iterations.

I rewrote it so that it the data was cached in an SMEM row, and it would refill every 32 floats as it did with the register caching. The kernel execution time went from like 230ms to 220ms average, but I doubt SMEM has a higher throughput than 1 instruction/clock/sm. But for some other reason, I should have been SMEM caching in this case that I was working with.