Waiting for global memory access.

Hi all.

I need your help to clarify my understanding of following statement from Progarmming Guide.

In my understanding this occurs in following way.

Let us have the following code:

The first 1/4 warp executes “do_something”, then starts to read from global memory.

While first 1/4 is wating for data from global memory “do_something” statements are executed by second 1/4 warp, when is starts to read from global memory

another 1/4 warp executes “do_something” and so on…

And I am wondering what is the actual “thread structure” (i.e. warp, 1/4 warp, block etc.) which can be executed while another is waiting for memory data.

I think the idea is that the ‘hiding instructions in memory latency’ happens when one multiprocessor can schedule multiple blocks at the same time. A pipeline-like thing will then happen as one block is reading from memory, while another is busy doing arithmetic.

The interleaving is not so much at the 1/4 warp level. It is at the warp and block level. To see how it works, you need to invert your example.

read a from global memory

b = f(a) // <-- only arithmetic instructions

write b to global memory

MANY of these warps can run concurrently on a multiprocessor. When all these are started up, they are all waiting on their “a” to be read in. As soon as the first “a” comes in, that warp starts executing the arithmetic instructions in the function f(). Other warps are still waiting for their “a”'s to be read in. Then the dance continues with many warps waiting for “a” while others are doing arithmetic. This is the interleaving the guide refers to. And the latency of the global memory is such that f() can contain 100’s of arithmetic instructions and still waste time waiting for memory to be read.

Try it yourself: setup a simple kernel (be sure to get full memory read coalescing) and write a kernel that multiplies every value by 2. Then write one with some complicated math function involving cosf or something costly. Both kernels should execute in the same total time. If you count GFLOP/s calculated it should be much larger for the 2nd kernel. If you count GB/s transferred, it should be the same in each.

MisterAnderson42, wumpus
Thank you.

To be honest I’ve noticed that effect in one of my kernels: each thread perfomed your pseudocode many times.

And when I scaled number of blocks in two times (and size of input data as a consequence) the kernel execution time didnot change much (about 10-20%).

How many blocks were you running? I’ve noticed this same effect when running fewer blocks than multiprocessors * number of blocks per multiprocessor (from the occupancy calculator). I.e fewer than 64 blocks.

While number of blocks is increasing from 8 to 32 in my task kernel execution time does not change much.

I used to do CTM programming before looking at Cuda. In CTM a thread can continue running after a texture request as long as the subsequent ALU operations are independent of the result of the texture operation. It is controlled by a semaphore mechanism.

When I first read the paragraph that Serge refers to in the Cuda manual I thought that something similar was going on in Cuda “behind the scenes”. But since the manual does mention the ‘thread scheduler’ specifically it might simply be a matter of switching warps?

I think someone from Nvidia needs to clarify. Please.



What exactly are your grid configurations (number of blocks x number of threads per block)? Also, what exactly does your kernel do?

I assume you call cudaThreadSynchronize() before starting the timer and just prior to stopping the timer?


64 threads x 8, 16, 24, 32, 48, 64 blocks

__global__ void

countj(SpectrumMatrix *dm, Parameters *dp, int *dN, float *dresult)


	float sum = 0;

	int idx = blockIdx.x*blockDim.x*blockDim.y + threadIdx.y *blockDim.x+ threadIdx.x;

	SpectrumMatrix &m = dm[idx];

	Parameters *p = dp + idx * NP;

	for (int jf = 0; jf < NF - 1; ++jf)


  for (int jt = 0; jt < NT - 1; ++jt)


  	float f = m.f_grid[jf];

  	float t = m.t_grid[jt];

  	float tmp = (m.s[jf*m.nT + jt] - get_multi_sp_value_cu(f, t, p, dN[idx]));

  	tmp *= tmp;

  	sum += tmp * (m.f_grid[jf + 1] - m.f_grid[jf])*(m.t_grid[jt + 1] - m.t_grid[jt]);




	dresult[idx] = sqrtf(sum);

  // return sqrtf(sum);


where get_multi_sp_value_cu(f, t, p, dN[idx])) uses “heavy” operations like cosf, exp etc.

for sure. I also use CUDA profiler.

OK. So, I can see how the times from 8 to 16 blocks won’t go up - 8 blocks utilize half of the available multiprocessors (assuming there are 16). I would try larger blocks - 128 and above.

Also, I would guess that your reads into m are uncoalesced, unless the size for the SpectrumMatrix types is 4, 8, or 16 bytes.

What are the kernel times you’re seeing?


Paulius, thank you.

What are the kernel times you’re seeing?

NT*NB time, ms

64x4 57155.5

64x8 58171.2

64x16 64615.5

64x24 69473.5

64x32 72859

64x48 125325

64x64 132313

4 and 8 blocks are measured “for fun” :)

Of course, it’s not exactly kernel times. It is sum of all kernel times in experiment.

I.e. it is something like

time = 0;

for (...)






time += getTimerValue


Can you list the occupancies for each as well? Also, how many registers and smem is your kernel using (check the .cubin file). It appears that your times go up every time the number of blocks per multiprocessor goes up by 2.


firstly, sorry for 2-month delay :">

well, checkout the attached grath

X-axis: number of blocks (each block = 32 threads i.e. 1 warp)
Y-axis: kernel time in ms

The occupancy is 0.25.
Therefore max number of concurrent blocks on multiprocessor in my case is 0.25*24 = 6.

And the graphic “jumps” when number of blocks is multiple of 6.

So, in my understanding:

  1. blocks number per MP 1…6: kernel time firstly changes insignificant due to memory latency hiding.

  2. blocks number per MP == 7: limit is reached, only 6 blocks can run “concurrently”, the seventh block has to wait them

  3. 6…12. six blocks are runnig “concurrently” on MP, remainng are waiting for them. then six blocks finished, remaining blocks also run “concurrently”


Are my explanations correct?

Having 64 threads will NOT hide the Regiser-read-write latency and Register-memory-bank latency. You need atleast 192 threads. Please see “Registers” section under “Performance Guidelines” ( section , I guess)

Not only your global memory stalls your operation. Even if you are performing computations only in shared memory – you still have register read-write conflicts which need to be overlapped with computation. For this, you need at least 192 threads inside your block.

See the following thread entry: I got 3X improvement after I changed the number of threads to 192.

well the question is: do you mean 64 threads per block or per multiprocessor?


is it matter if we have 6 blocks * 1 warp per multiprocessor or 1 block * 6 warp?

(in case threads do not interact via shared mem/ of course)

and how would you explain the fact that kernel time doesnot change significantly while we’re enlarging number of blocks form 1 to 6 (see graph)?

BTW thanks for joining the discussion. :thumbup:

Your graph jumps at 96, 192 ,288. I dont understand why you would call them as multiple of “6”. May b, you were referring it at a single MP level.

I understand your explanation. But let me restate as how I understand. I think our explanations are going to be similar.

When you schedule 16 blocks, I would assume that the GPU would schedule them straight on 16 Multiprocessors. So, the turn around time that you see here is the base minimum turn around time required to execute just 1 block. This remains constant until 96. This is explainable as you can run 6 Blocks concurrently within an MP and it looks like the GPU is effectively overlapping GLOBAL Memory access and computation. It seems to me that your GLOBAL memory access is what is determining your block-turn-around time. The computation is very minimal. Hence, the time taken remains constant for 1 block as well as 6 blocks running on the same MP (with 96 blocks running concurrently). Thats the reason that I can think of.

Now, what happens from 96 to 192? When the initial 96 blocks are over, another 96 blocks are created (or re-used). Obviously, the time taken is going to increase. As you can see from the graph, it is doubled. The reason is very obvious.

I understand your question on having multiple blocks. Since your occupancy is 0.25, you are actully using 6 WARPs which equals to 192 threads. So, Thats good in a way. Anyway, your application looks to be constrained only by global memory access and NOT by computation. So, this latency may NOT matter at all.
I am NOT sure about the latencies of “Switching Blocks” when compared to “switching WARPS”. If your occupancy is NOT limited by “registers” then you can try increasing WARPs per block and see if it matters.

I mean that 96 = 616, i.e. 96 blocks gives us 6 blocks per MP. 192 blocks gives us 62 blocks per MP and so on. 6 is significant since occupancy is 0.25 and 0.25*24 = 6.

Thank you for detailed explanation of your understanding. Now I can see that it is almost the same as mine. :)

So, I am going to do some more tests in the beginning of the new year to determine the factors which affect kernel timing and its jumps.

192 threads/block are not always faster than smaller blocks. Many of my kernels have peak performance at 64 threads per block. The whole register read after write dependency is only one of the many competing interactions that can change performance, depending on your kernel it may not matter. The only way to be certain of the optimal block size is to benchmark all block sizes in multiples of 32.


Good Luck and Happy New year! Please keep us posted if you find out anything interesting.

Sarnath, thank you.

Happy New Year, comrade.

And Happy New Year to all the CUDA-community :magic:

I’ll certainly keep everyone here informed, in case I get useful results.