Waiting for global memory access.

serge · October 29, 2007, 10:46am

Hi all.

I need your help to clarify my understanding of following statement from Progarmming Guide.

In my understanding this occurs in following way.

Let us have the following code:

The first 1/4 warp executes “do_something”, then starts to read from global memory.

While first 1/4 is wating for data from global memory “do_something” statements are executed by second 1/4 warp, when is starts to read from global memory

another 1/4 warp executes “do_something” and so on…

And I am wondering what is the actual “thread structure” (i.e. warp, 1/4 warp, block etc.) which can be executed while another is waiting for memory data.

wumpus · October 29, 2007, 10:52am

I think the idea is that the ‘hiding instructions in memory latency’ happens when one multiprocessor can schedule multiple blocks at the same time. A pipeline-like thing will then happen as one block is reading from memory, while another is busy doing arithmetic.

MisterAnderson42 · October 29, 2007, 12:52pm

The interleaving is not so much at the 1/4 warp level. It is at the warp and block level. To see how it works, you need to invert your example.

read a from global memory

b = f(a) // <-- only arithmetic instructions

write b to global memory

MANY of these warps can run concurrently on a multiprocessor. When all these are started up, they are all waiting on their “a” to be read in. As soon as the first “a” comes in, that warp starts executing the arithmetic instructions in the function f(). Other warps are still waiting for their “a”'s to be read in. Then the dance continues with many warps waiting for “a” while others are doing arithmetic. This is the interleaving the guide refers to. And the latency of the global memory is such that f() can contain 100’s of arithmetic instructions and still waste time waiting for memory to be read.

Try it yourself: setup a simple kernel (be sure to get full memory read coalescing) and write a kernel that multiplies every value by 2. Then write one with some complicated math function involving cosf or something costly. Both kernels should execute in the same total time. If you count GFLOP/s calculated it should be much larger for the 2nd kernel. If you count GB/s transferred, it should be the same in each.

serge · October 29, 2007, 1:40pm

MisterAnderson42, wumpus
Thank you.

MisterAnderson42
To be honest I’ve noticed that effect in one of my kernels: each thread perfomed your pseudocode many times.

And when I scaled number of blocks in two times (and size of input data as a consequence) the kernel execution time didnot change much (about 10-20%).

MisterAnderson42 · October 29, 2007, 3:21pm

How many blocks were you running? I’ve noticed this same effect when running fewer blocks than multiprocessors * number of blocks per multiprocessor (from the occupancy calculator). I.e fewer than 64 blocks.

serge · October 29, 2007, 8:25pm

While number of blocks is increasing from 8 to 32 in my task kernel execution time does not change much.

tsan · October 29, 2007, 10:19pm

I used to do CTM programming before looking at Cuda. In CTM a thread can continue running after a texture request as long as the subsequent ALU operations are independent of the result of the texture operation. It is controlled by a semaphore mechanism.

When I first read the paragraph that Serge refers to in the Cuda manual I thought that something similar was going on in Cuda “behind the scenes”. But since the manual does mention the ‘thread scheduler’ specifically it might simply be a matter of switching warps?

I think someone from Nvidia needs to clarify. Please.

/Thomas

paulius · October 29, 2007, 10:22pm

Serge,

What exactly are your grid configurations (number of blocks x number of threads per block)? Also, what exactly does your kernel do?

I assume you call cudaThreadSynchronize() before starting the timer and just prior to stopping the timer?

Paulius

serge · October 29, 2007, 11:22pm

64 threads x 8, 16, 24, 32, 48, 64 blocks

__global__ void

countj(SpectrumMatrix *dm, Parameters *dp, int *dN, float *dresult)

{

	float sum = 0;

	int idx = blockIdx.x*blockDim.x*blockDim.y + threadIdx.y *blockDim.x+ threadIdx.x;

	SpectrumMatrix &m = dm[idx];

	Parameters *p = dp + idx * NP;

	for (int jf = 0; jf < NF - 1; ++jf)

	{

  for (int jt = 0; jt < NT - 1; ++jt)

  {

  	float f = m.f_grid[jf];

  	float t = m.t_grid[jt];

  	float tmp = (m.s[jf*m.nT + jt] - get_multi_sp_value_cu(f, t, p, dN[idx]));

  	tmp *= tmp;

  	sum += tmp * (m.f_grid[jf + 1] - m.f_grid[jf])*(m.t_grid[jt + 1] - m.t_grid[jt]);

  }

	}

	

	dresult[idx] = sqrtf(sum);

  // return sqrtf(sum);

}

where get_multi_sp_value_cu(f, t, p, dN[idx])) uses “heavy” operations like cosf, exp etc.

for sure. I also use CUDA profiler.

paulius · October 30, 2007, 1:28am

OK. So, I can see how the times from 8 to 16 blocks won’t go up - 8 blocks utilize half of the available multiprocessors (assuming there are 16). I would try larger blocks - 128 and above.

Also, I would guess that your reads into m are uncoalesced, unless the size for the SpectrumMatrix types is 4, 8, or 16 bytes.

What are the kernel times you’re seeing?

Paulius

serge · October 31, 2007, 8:30am

Paulius, thank you.

What are the kernel times you’re seeing?

NT*NB time, ms

64x4 57155.5

64x8 58171.2

64x16 64615.5

64x24 69473.5

64x32 72859

64x48 125325

64x64 132313

4 and 8 blocks are measured “for fun” :)

Of course, it’s not exactly kernel times. It is sum of all kernel times in experiment.

I.e. it is something like

time = 0;

for (...)

{

resetTimer

startTimer

kernel

stopTimer

time += getTimerValue

}

paulius · October 31, 2007, 5:31pm

Can you list the occupancies for each as well? Also, how many registers and smem is your kernel using (check the .cubin file). It appears that your times go up every time the number of blocks per multiprocessor goes up by 2.

Paulius

serge · December 26, 2007, 3:21pm

paulius
firstly, sorry for 2-month delay :">

well, checkout the attached grath

X-axis: number of blocks (each block = 32 threads i.e. 1 warp)
Y-axis: kernel time in ms

The occupancy is 0.25.
Therefore max number of concurrent blocks on multiprocessor in my case is 0.25*24 = 6.

And the graphic “jumps” when number of blocks is multiple of 6.

So, in my understanding:

blocks number per MP 1…6: kernel time firstly changes insignificant due to memory latency hiding.
blocks number per MP == 7: limit is reached, only 6 blocks can run “concurrently”, the seventh block has to wait them
6…12. six blocks are runnig “concurrently” on MP, remainng are waiting for them. then six blocks finished, remaining blocks also run “concurrently”

etc.

Are my explanations correct?

Sarnath · December 27, 2007, 4:06am

Having 64 threads will NOT hide the Regiser-read-write latency and Register-memory-bank latency. You need atleast 192 threads. Please see “Registers” section under “Performance Guidelines” ( section 5.1.2.5 , I guess)

Not only your global memory stalls your operation. Even if you are performing computations only in shared memory – you still have register read-write conflicts which need to be overlapped with computation. For this, you need at least 192 threads inside your block.

See the following thread entry: I got 3X improvement after I changed the number of threads to 192.
“The Official NVIDIA Forums | NVIDIA”

serge · December 27, 2007, 7:33am

well the question is: do you mean 64 threads per block or per multiprocessor?

i.e.

is it matter if we have 6 blocks * 1 warp per multiprocessor or 1 block * 6 warp?

(in case threads do not interact via shared mem/ of course)

and how would you explain the fact that kernel time doesnot change significantly while we’re enlarging number of blocks form 1 to 6 (see graph)?

BTW thanks for joining the discussion. External Media

Sarnath · December 27, 2007, 11:31am

Your graph jumps at 96, 192 ,288. I dont understand why you would call them as multiple of “6”. May b, you were referring it at a single MP level.

I understand your explanation. But let me restate as how I understand. I think our explanations are going to be similar.

When you schedule 16 blocks, I would assume that the GPU would schedule them straight on 16 Multiprocessors. So, the turn around time that you see here is the base minimum turn around time required to execute just 1 block. This remains constant until 96. This is explainable as you can run 6 Blocks concurrently within an MP and it looks like the GPU is effectively overlapping GLOBAL Memory access and computation. It seems to me that your GLOBAL memory access is what is determining your block-turn-around time. The computation is very minimal. Hence, the time taken remains constant for 1 block as well as 6 blocks running on the same MP (with 96 blocks running concurrently). Thats the reason that I can think of.

Now, what happens from 96 to 192? When the initial 96 blocks are over, another 96 blocks are created (or re-used). Obviously, the time taken is going to increase. As you can see from the graph, it is doubled. The reason is very obvious.

I understand your question on having multiple blocks. Since your occupancy is 0.25, you are actully using 6 WARPs which equals to 192 threads. So, Thats good in a way. Anyway, your application looks to be constrained only by global memory access and NOT by computation. So, this latency may NOT matter at all.
And,
I am NOT sure about the latencies of “Switching Blocks” when compared to “switching WARPS”. If your occupancy is NOT limited by “registers” then you can try increasing WARPs per block and see if it matters.

serge · December 27, 2007, 1:42pm

I mean that 96 = 616, i.e. 96 blocks gives us 6 blocks per MP. 192 blocks gives us 62 blocks per MP and so on. 6 is significant since occupancy is 0.25 and 0.25*24 = 6.

Thank you for detailed explanation of your understanding. Now I can see that it is almost the same as mine. :)

So, I am going to do some more tests in the beginning of the new year to determine the factors which affect kernel timing and its jumps.

MisterAnderson42 · December 27, 2007, 2:42pm

192 threads/block are not always faster than smaller blocks. Many of my kernels have peak performance at 64 threads per block. The whole register read after write dependency is only one of the many competing interactions that can change performance, depending on your kernel it may not matter. The only way to be certain of the optimal block size is to benchmark all block sizes in multiples of 32.

Sarnath · December 27, 2007, 3:08pm

Serge,

Good Luck and Happy New year! Please keep us posted if you find out anything interesting.

serge · December 27, 2007, 3:39pm

Sarnath, thank you.

Happy New Year, comrade.

And Happy New Year to all the CUDA-community :magic:

I’ll certainly keep everyone here informed, in case I get useful results.

Topic		Replies	Views
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8482	March 28, 2008
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30286	December 15, 2010
Global Memoy latencies and NVIDIA cards Latency CUDA Programming and Performance	15	8849	January 11, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1969	October 8, 2020
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312101	September 13, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Doubts related to CUDA CUDA Programming and Performance	17	11805	November 18, 2010
Discrepancy between theoretical occupancy and achieved occupancy depending on ThreadsPerBlock CUDA Programming and Performance cuda	7	112	September 6, 2024
Global thread barrier CUDA Programming and Performance	78	85630	December 23, 2011
Efficient use of shared memory CUDA Programming and Performance	29	4378	December 2, 2019

Waiting for global memory access.

Related topics