Is a load from global memory to shared memory decomposed into a load from global memory to a temporary register and a write from that register to shared memory? Or does data move from global memory directly to shared memory without ever “reaching” a processing core. My guess would be the first seeing how you have direct control of shared memory, but I want to rule out any architectural trickery that could get around this. Thanks.
Yes, loads from global to shared memory are always performed through registers.
You can check that by yourself by looking at the output of cuobjdump -sass for an example program.
I thought as much, profiling a sample kernel told me I was increasing the register count by using shared memory, and now that you mentioned sass I took a look at the sass code in the nsight profiler to confirm it. Thanks.