I need your help to clarify my understanding of following statement from Progarmming Guide.
In my understanding this occurs in following way.
Let us have the following code:
The first 1/4 warp executes “do_something”, then starts to read from global memory.
While first 1/4 is wating for data from global memory “do_something” statements are executed by second 1/4 warp, when is starts to read from global memory
another 1/4 warp executes “do_something” and so on…
And I am wondering what is the actual “thread structure” (i.e. warp, 1/4 warp, block etc.) which can be executed while another is waiting for memory data.
I think the idea is that the ‘hiding instructions in memory latency’ happens when one multiprocessor can schedule multiple blocks at the same time. A pipeline-like thing will then happen as one block is reading from memory, while another is busy doing arithmetic.
The interleaving is not so much at the 1/4 warp level. It is at the warp and block level. To see how it works, you need to invert your example.
read a from global memory
b = f(a) // <-- only arithmetic instructions
write b to global memory
MANY of these warps can run concurrently on a multiprocessor. When all these are started up, they are all waiting on their “a” to be read in. As soon as the first “a” comes in, that warp starts executing the arithmetic instructions in the function f(). Other warps are still waiting for their “a”'s to be read in. Then the dance continues with many warps waiting for “a” while others are doing arithmetic. This is the interleaving the guide refers to. And the latency of the global memory is such that f() can contain 100’s of arithmetic instructions and still waste time waiting for memory to be read.
Try it yourself: setup a simple kernel (be sure to get full memory read coalescing) and write a kernel that multiplies every value by 2. Then write one with some complicated math function involving cosf or something costly. Both kernels should execute in the same total time. If you count GFLOP/s calculated it should be much larger for the 2nd kernel. If you count GB/s transferred, it should be the same in each.
How many blocks were you running? I’ve noticed this same effect when running fewer blocks than multiprocessors * number of blocks per multiprocessor (from the occupancy calculator). I.e fewer than 64 blocks.
I used to do CTM programming before looking at Cuda. In CTM a thread can continue running after a texture request as long as the subsequent ALU operations are independent of the result of the texture operation. It is controlled by a semaphore mechanism.
When I first read the paragraph that Serge refers to in the Cuda manual I thought that something similar was going on in Cuda “behind the scenes”. But since the manual does mention the ‘thread scheduler’ specifically it might simply be a matter of switching warps?
I think someone from Nvidia needs to clarify. Please.
OK. So, I can see how the times from 8 to 16 blocks won’t go up - 8 blocks utilize half of the available multiprocessors (assuming there are 16). I would try larger blocks - 128 and above.
Also, I would guess that your reads into m are uncoalesced, unless the size for the SpectrumMatrix types is 4, 8, or 16 bytes.
Can you list the occupancies for each as well? Also, how many registers and smem is your kernel using (check the .cubin file). It appears that your times go up every time the number of blocks per multiprocessor goes up by 2.
Having 64 threads will NOT hide the Regiser-read-write latency and Register-memory-bank latency. You need atleast 192 threads. Please see “Registers” section under “Performance Guidelines” ( section 5.1.2.5 , I guess)
Not only your global memory stalls your operation. Even if you are performing computations only in shared memory – you still have register read-write conflicts which need to be overlapped with computation. For this, you need at least 192 threads inside your block.
Your graph jumps at 96, 192 ,288. I dont understand why you would call them as multiple of “6”. May b, you were referring it at a single MP level.
I understand your explanation. But let me restate as how I understand. I think our explanations are going to be similar.
When you schedule 16 blocks, I would assume that the GPU would schedule them straight on 16 Multiprocessors. So, the turn around time that you see here is the base minimum turn around time required to execute just 1 block. This remains constant until 96. This is explainable as you can run 6 Blocks concurrently within an MP and it looks like the GPU is effectively overlapping GLOBAL Memory access and computation. It seems to me that your GLOBAL memory access is what is determining your block-turn-around time. The computation is very minimal. Hence, the time taken remains constant for 1 block as well as 6 blocks running on the same MP (with 96 blocks running concurrently). Thats the reason that I can think of.
Now, what happens from 96 to 192? When the initial 96 blocks are over, another 96 blocks are created (or re-used). Obviously, the time taken is going to increase. As you can see from the graph, it is doubled. The reason is very obvious.
I understand your question on having multiple blocks. Since your occupancy is 0.25, you are actully using 6 WARPs which equals to 192 threads. So, Thats good in a way. Anyway, your application looks to be constrained only by global memory access and NOT by computation. So, this latency may NOT matter at all.
And,
I am NOT sure about the latencies of “Switching Blocks” when compared to “switching WARPS”. If your occupancy is NOT limited by “registers” then you can try increasing WARPs per block and see if it matters.
I mean that 96 = 616, i.e. 96 blocks gives us 6 blocks per MP. 192 blocks gives us 62 blocks per MP and so on. 6 is significant since occupancy is 0.25 and 0.25*24 = 6.
Thank you for detailed explanation of your understanding. Now I can see that it is almost the same as mine. :)
So, I am going to do some more tests in the beginning of the new year to determine the factors which affect kernel timing and its jumps.
192 threads/block are not always faster than smaller blocks. Many of my kernels have peak performance at 64 threads per block. The whole register read after write dependency is only one of the many competing interactions that can change performance, depending on your kernel it may not matter. The only way to be certain of the optimal block size is to benchmark all block sizes in multiples of 32.