uboat, here you’d likely just use one block with one thread.
Though you can argue it’d be interesting to see if one thread and 30 blocks (or however many SMs you have) would behave. It gets complicated. But read that linked paper if you want to get into the details, especially Figure 1 which will show you there is no such thing as a single number to define “latency”.
You want the SPs idle when you are measuring latency. You want the timing to depend on the memory controller’s latency, without any extra processing delay, so the SP (even just one thread) should be doing nothing but waiting for that memory read to finish.
If you want to measure throughput/bandwidth, that’s an entirely different issue and pretty much independent of latency.
Real vvolkov’s paper, it’s got a ton of great detail.