does the GPU have to wait around for writes to global memory to complete? If not, how many of these memory transaction can be in flight at a given point in time before it starts blocking? What are the implications on parallelism if, for a while, the only thing going on in a thread block is shared mem computation and relatively scattered writes?
Here’s some more background: I’m trying to decide how to map a domain-decomposing PDE solver into CUDA, and while I should be able to keep a good bunch of threads busy with math, threads that end up working on subdomain boundary faces would need to write to very scattered locations instead of doing math. Can I do that without forcing the busy math threads to wait?
If your thread’s usage of shared memory and registers is low enough, the scheduler can time slice multiple blocks on the same multiprocessor at once (as long as you request more blocks than multiprocessors, of course). This is the easiest way to hide memory latency.
There are 8192 registers and 16 kB of shared memory per multiprocessor. Pass --ptxas-options=-v to nvcc, and it will print out how much your kernel uses.
So, I take that as a “yes”, right? “Yes”, as in, the respective thread stalls until the write is complete? For a read, I can see that the processor has to wait until the data becomes available. But a write could potentially (given the right circuitry) be fired off and left to complete by itself.
Also, how do memory stalls and warp divergence interact? Suppose only part of a warp stalls on a memory read–does the other half keep running or stall as well?
Hmm, my understanding was that once they’ve diverged, different parts of the warp do different stuff. So suppose only one part of the warp executes a gmem read, and the other one doesn’t, does this read affect the non-reading part of the warp?