Parallel Access to GDU Global Memory

As part of the learning curve for CUDA, there is something that has been bothering me for a while now. Let me explain:

For the sake of example, let’s say that I have a Grid of 32 blocks and a Blocks of 512 threads. That is a total of 16384 threads in the Grid. In addition, assume that my kernel code is such that at some point, each thread will be staging the read of some data from Global to Shared memory.

I think that I understand the concept of having Warps running some floating point calculations, while others are waiting while the Global memory access completes.

Here is the part that I don’t understand. Let’s say that half the threads, around 8000 of them, have requested the access of data from the DRAM. Because of things like CAS, RAS and other timings for DDR2, DDR3 memory, how can the DRAM be setup to access 8000 separate memory accesses. That is, will the 8000 accesses, assuming that they are started at about the same time, all take just 200 cycles or so. How can the DRAM provide for such?

It works differently :)

Let’s take the 8800 GTX as an example

You have 16 multiprocessors.
Each multiprocessor can have as a maximum 768 threads ‘running’ on it
So you can have as a maximum 16x768 = 12288 threads ‘running’ at a single moment.

In your example it would be even less. You would have only 512 threads ‘running’ on a multiprocessor at a time, since blocks cannot be spread over multiprocessors. So you would have 16x512 = 8192 threads running at a time. Some of those threads are waiting for memory to arrive, others are doing calculations. So for your example you would have 16 blocks running till completion, after which the last 16 blocks are running to completion.

Splitting it up in 64 blocks of 256 threads would mean having you can have 3 blocks on a multiprocessor, so 16x3 = 48 blocks running after which only 16 blocks have to finish. So it is quite possible that having less threads per block can speed up your calculations (which for me was quite counter-intuitively)

This all is btw heavily dependent on how many registers and shared memory your kernel requires. That can limit the amount of threads per block, and the amount of blocks per mulitprocessor at one single moment. You can play a bit with the occupancy calculator to get a feeling for this.

Denis, thanks for the explanation.

My question is a little different, I think :)

For the sake of a simple example, lets say that we have 16 blocks, each at 32 threads for a total of 512 threads in the kernel. In addition, lets say that the kernel code is such that all 16 Warps hit an instruction to stage a read from Global memory to Shared memory. At that point, all 32 Warps are waiting for the data to come back from Global memory. No other processing can occur until we read the data into shared memory.

The hypothetical question for this example then is that if we have 512 reads waiting from memory, what is the limitation for the Global memory access. It seems to me that not all 512 reads could complete in 200 GDU clock cycles.

Is there X number maximum Global thread reads per 200 cycle specification for the G80?

Denis,

Maybe this is what you were trying to explain. Lets say that for a given GDU, the latency when accessing Global memory is 200 clock cycles. Is it that the 200 clock cycles is a specification that applies per Warp. That is, if 16 Warps are waiting for access to Global memory, is there a queue of Warp-to-Global memory access such that after 200 clock cycles some Warp in the queue gets the data into the GDU. Then another Warp is given access to the Global memory, such that 200 cycles later that Warp gets its data?

Summary: Is the Global memory latency specified per Warp? If so, I wonder what happens if and when the size of a Warp in future processors is 16 or even 1?

The questions you ask now are to be answered by an NVIDIA engineer I think (or maybe it’s possible to find out by smart testing). I do not remember reading specs about it before.
Warp size is something I would expect to grow, not shrink. Only when they make the instruction unit run at a higher clock rate I would expect it to shrink, but I do not see the added value in that.

But basically having 16 blocks of 32 threads does indeed give you no opportunity at all for hiding global memory latency if you have 16 multiprocessors. That is why more blocks with more threads are needed to have performance :D

What is a GDU? Do you mean GPU?

Memory loads and stores are just like other instructions. They are executed one warp at a time. So you can’t have 8000 simultaneous loads.

If a load for a warp is coalesced, a single memory transaction is issues. If not, then multiple memory transactions are issues.

Once a warp issues an instruction, usually control passes to another warp and it executes an intruction. In other words, processing proceeds across warps before proceeding down the instructions of a single warp. In this way pipeline and memory latency can be hidden. Once control returns to a warp, if it is not blocked waiting on data from a load or a barrier, then it issues its next instruction.

Memory instructions are no different. Warps proceed sequentially on the multiprocessor. After one warp issues its load, the next proceeds. When a warp is no longer blocked waiting for data control will eventually return to it and it will issue another instruction.

So take this code

0 MOV R0, 0x2
1 MOV R1, 0x4
2 LOAD R2, MEM
3 MUL R0, R0, R1
4 MAD R0, R1, R2, R0

If you have, say, 4 warps, then you’ll get:

0: WARP0(MOV), WARP1(MOV), WARP2(MOV), WARP3(MOV)
1: WARP0(MOV), WARP1(MOV), WARP2(MOV), WARP3(MOV)
2: WARP0(LOAD), WARP1(LOAD), WARP2(LOAD), WARP3(LOAD)
3: WARP0(MUL), WARP1(MUL), WARP2(MUL), WARP3(MUL)
STALL until first LOAD returns
4: WARP0(MAD), WARP1(MAD), WARP2(MAD), WARP3(MAD)

See, since the MUL is not dependent on the LOAD, the MUL can proceed without waiting on it. But once the MUL is completed for all warps, if the data for none of the LOADs is ready, then the warps all stall until one ofthem gets its data.

Note that during that stall, there may be warps from other thread blocks active on the multiprocessor that can proceed with their work. This is important in the case where you have

LOAD
BAR // __syncthreads();
MUL

Because now even if the MUL is independent of the LOAD, all theads in the block have to wait for the barrier (__syncthreads()). This means that you can only hide the memory latency if you have more than one active thread block per multiprocessor.

Mark

Mark,

Thank you very much. An excellent explanation … Yes GPU and not GDU :)

It sounds like you are saying that loads from Global memory are per WARP. In other words, your entry after instruction 3 says “STALL until first LOAD returns”. If you have your 4 WARPS all waiting on the LOAD from memory to complete at instruction #4, is the 200 cycle latency for just one of the WARPS, or could the 4 WARPS conceivably be ready with data 200 cycles after instruction 3?

Latency is per memory transaction, but it is pipelined. So if the load for all threads of each warp is coalesced into a single transaction, then the latency is about 400 cycles (not sure where you got 200) for the first one, and each subsequent one will arrive about every 4 cycles.

That means if

(#warps) * (# instructions after the load that don’t depend on it) * (cycles per instruction) >= memory latency

You will have no stall.

Mark

Mark,

Thank you. All clear now … “Loads are pipelined and 4 cycles per load for any 2 or more in the pipeline, after the first Load in the pipeline”

Not sure where I got the 200 number either :">

On a related note, on which clock domain is the memory management unit running? When I overclock my G8800, I’ve found that memory-request dominated programs speed-up nearly linearly with the “core” clock frequency, but not at all with “memory” clock frequency. I am under the impression that the shader clock stays the same regardless of these other two settings.

I think this implies that the latency that in this test application, the bottle neck is the part of the GPU that queues up load instructions from the shaders and issues them to the RAM, and it appears that circuitry runs off of the “core” clock.

Thanks,

Paul