does the GPU have to wait around for writes to global memory to complete? If not, how many of these memory transaction can be in flight at a given point in time before it starts blocking? What are the implications on parallelism if, for a while, the only thing going on in a thread block is shared mem computation and relatively scattered writes?
Here’s some more background: I’m trying to decide how to map a domain-decomposing PDE solver into CUDA, and while I should be able to keep a good bunch of threads busy with math, threads that end up working on subdomain boundary faces would need to write to very scattered locations instead of doing math. Can I do that without forcing the busy math threads to wait?