Writes to global memory

inducer · May 9, 2008, 9:10pm

Hi there,

does the GPU have to wait around for writes to global memory to complete? If not, how many of these memory transaction can be in flight at a given point in time before it starts blocking? What are the implications on parallelism if, for a while, the only thing going on in a thread block is shared mem computation and relatively scattered writes?

Here’s some more background: I’m trying to decide how to map a domain-decomposing PDE solver into CUDA, and while I should be able to keep a good bunch of threads busy with math, threads that end up working on subdomain boundary faces would need to write to very scattered locations instead of doing math. Can I do that without forcing the busy math threads to wait?

Thanks, Andreas

seibert · May 9, 2008, 10:27pm

If your thread’s usage of shared memory and registers is low enough, the scheduler can time slice multiple blocks on the same multiprocessor at once (as long as you request more blocks than multiprocessors, of course). This is the easiest way to hide memory latency.

There are 8192 registers and 16 kB of shared memory per multiprocessor. Pass --ptxas-options=-v to nvcc, and it will print out how much your kernel uses.

inducer · May 9, 2008, 10:58pm

So, I take that as a “yes”, right? “Yes”, as in, the respective thread stalls until the write is complete? For a read, I can see that the processor has to wait until the data becomes available. But a write could potentially (given the right circuitry) be fired off and left to complete by itself.

Also, how do memory stalls and warp divergence interact? Suppose only part of a warp stalls on a memory read–does the other half keep running or stall as well?

Thanks

Andreas

paulius · May 10, 2008, 12:37am

No, threads don’t stall on a gmem write.

Warps execute the same instruction, so it’s not possible for only a part of a warp to stall, if a stall does occur.

Paulius

inducer · May 10, 2008, 1:12am

Hmm, my understanding was that once they’ve diverged, different parts of the warp do different stuff. So suppose only one part of the warp executes a gmem read, and the other one doesn’t, does this read affect the non-reading part of the warp?

Thanks

Andreas

Topic		Replies	Views
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8937	January 24, 2008
global memory access synchronous or asynchronous read/write? CUDA Programming and Performance	3	3414	May 15, 2008
How do GPUs "handle" writes? CUDA Programming and Performance	12	3606	March 10, 2018
Global Memory Fetches How to arrange them in code for best performance CUDA Programming and Performance	6	1217	June 2, 2010
global memory read after write CUDA Programming and Performance	4	3275	March 25, 2009
global reading vs writing latency CUDA Programming and Performance	3	3676	March 23, 2007
Registers and Shared Memory question CUDA Programming and Performance	7	5454	September 10, 2007
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8349	February 17, 2008
Some Performance Consideration Questions warp divergence, coalescing and shared mem then and now CUDA Programming and Performance	1	2011	March 8, 2012
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4370	May 12, 2015

Writes to global memory

Related topics