I can’t seem to find any website/PDFs listing the time of a register shuffle operation, or how may cycles it should take - is there any info on this? (I assume 1-15 cycles, but I don’t know if there is an exact number).
Also, if many warps in a threadblock (or SM) are constantly performing register shuffles, can they block on each other? Or does it effectively have unlimited bandwidth?
If warps in a threadblock are attempting to read from shared memory, do they block on each others’ shared memory reads, or are these executed in parallel? (I’m asking this as nvvp is saying my program is shared memory limited when I don’t have any bank conflicts - maybe this is because it’s limiting occupancy to 50%, or perhaps bandwidth).
I asked a similar question about GMEM - and Robert Crovella explained that it operates analogous to a pipeline architecture (so requests do not block on each other) but I was wondering if the same was true for register-shuffles/SMEM.