In this case, only one thread will be performing a read. Will this read
still take 400 to 600 clock cycles, or will it be much faster? (I ask this
question because sometimes the very first or last thread in a block
may need an extra variable from global memory.)
The answer is no - it will not be much faster. Actually the difference between the two access patters in terms of required clock cycles should not differ a whole lot.
The way to see this is the other way round: One memory read will take a long time. Coalesced memory reads (=accessing subsequent areas in the memory by subsequent threads) will not add significant latency.
Consider reading chapter 5.1.2.1 in the programming guide for more information on global memory access.
You could consider constant memory for those values (if they are read only) or store them in shared memory if you calculated them before in the same kernel/block.
It should, yes. But CUDA is able to optimize this memory access because the memory is accessed sequentially (coalesced memory access - I referenced the relevant chapter in the programming guide in my previous post).
Well, your original example only used 32 threads (as indicated by the array sizes). If you had run more than 32 threads it would have crashed. The way to think about coalescing is that it combines all accesses within a group of threads (a warp) into a single memory transaction (assuming the addresses satisfy coalescing conditions, which yours do).
On G8X coalescing happens at half-warp resolution, so your code without the if will take about twice as long to execute (two memory transactions) as the version with the if (one memory transaction).
Note that changing the if condition to if (threadIdx.x < 16) will likely result in the same performance as if (threadIdx.x == 0) in your example.
This is a good question. I have been pondering over this as well. Somehow I assumed that there would be some streaming operation over the bus and the 4 floats would be pumped in (Just like how load multiple would work on powerPC) fast… My assumption…
OR perhaps
I think 16 8-bit accessing loads may be coalesced into 1 128-bit load instruction. May b, thats why the half-warp size is 16… But I am surely missing something here. THe global read coalescing should work for floats and ints as well…
this is only true if your data access pattern is coalescing, same as a seek operation on a hard disk which takes long, but reading a chunk of consecutive data is fast.
I remember to have read someone from nvidia say that uncoalesced reads are serialized. So that would indeed mean that the total time is doubled. However my memory is not that super.
But maybe it is easier to try than to guess? ;) You can always make a small test kernel to check these things out. But I think it is just better to try to keep everything nice and coalesced. And otherwise use textures as I have also read that reading in data from a texture is faster than reading in data from global memory uncoalesced.