We tried an experiment to evaluate scalability of our CUDA program. We’re using a Quadro FX 5800, with 240 cores.
Run a kernel with grid size 1 and block size 1. This would be expected to run on a single core of the GPU.
Repeat the experiment, with different grid and block sizes. When we run with sufficiently large grid and block sizes, we can expect a max. speed-up of 240 (since there are only 240 cores).
In particular, if the block size stays at 1, and the grid size is increased, then each SM will only use 1 core (out of 8) at a time. Here, the max speed up that can be achieved is 30 (since there are 30 SMs).
However, we’re seeing that:
a ) A kernel <<<240,1>>> runs 180 times faster than the kernel <<<1,1>>>.
b ) A kernel <<<200,1>>> runs 157 times faster than the kernel <<<1,1>>>.
c ) A kernel <<<100,1>>> runs 90 times faster than the kernel <<<1,1>>>.
d ) A kernel <<<10,1>>> runs 9 times faster than the kernel <<<1,1>>>.
So the question really is:
When a SM executes a half warp, can this half-warp consist of threads from multiple blocks? In Chapter 3 (page 14) of the CUDA Programming Guide Version 2.0,
When a multiprocessor is given one or more thread blocks to execute, it splits them
into warps that get scheduled by the SIMT unit. The way a block is split into warps
is always the same; each warp contains threads of consecutive, increasing thread IDs
with the first warp containing thread 0. Section 2.1 describes how thread IDs relate
to thread indices in the block.
This indicates that a warp is only made up of threads WITHIN a block. However, our performance numbers above indicate otherwise. Any ideas as to what is really going on?
Still, the numbers that we see are much larger than expected. Does the register latency account for so much of overhead? Has anyone else seen this behavior?
For capability 1.3 devices, each multiprocessor can have up to 32 active warps. If you are underutilizing memory bandwidth, but stalled on memory latency, you can get higher throughput by having each multiprocessor time-slice between multiple active warps. Bandwidth can give you one float every 20 clocks (or on that order of magnitude) whereas latency has you waiting more than ten times that long.
Register latency can also be hidden by higher occupancy, but it is smaller to begin with, and sometimes in a single thread the instructions can be scheduled so as to not hit a latency problem.
I would say, IF you are compute-bound, and IF you do not have stalls from register or memory dependencies, then you should expect a speedup of 30.
Block size of 1 and block size of 32 - have no difference. Both will take same time (assuming the workload is 32 times higher for a WARP compared to a single thread)
But for B and A it’s more tricky. Each MP can have up to 8 active blocks, so for B you would have 8*200 = 1600 active threads, but your device can only handle 1024 active threads. Maybe that’s why it isn’t twice as fast as C…
Same goes for A
-Nico
PS: OOPS, I noticed I was confusing grid size and block size :-)
On the other hand, if your kernel is heavily bandwith limited, each new block run in parallel on the same SM contributes to the speedup because the scheduler has a way to hide the latency.
In ideal scenario, without counting scheduler’s overhead and other things: 32 warps on 30 SM could give you speedup of 960 as compared to <<<1,32>>> configuration and even more compared to <<<1,1>>>.
I don’t see how the block size of 1 and 32 can be equivalent. With a block size of 1, only 1 core (out of 8 cores on a SM) will be used. With a block size of 32, all 8 cores are likely to be used.
With a kernel <<1,1>>, I only have 1 thread, and 1 block. Hence the entire computation is executed sequentially on a single core of the device.
CUDA devices are not traditional multicore CPUs with independent schedulers for each SP (“core”). A warp (or perhaps a half-warp in current chips, but we are told to think of the full warp) is scheduled as a unit, and the 32 threads are folded up into 8 groups, 4 threads deep and pipelined into the SPs on the multiprocessor executing the warp. Since all the threads in the warp are running the same instruction (or are masked out), there only needs to be one instruction decoder per multiprocessor. If you have an unfilled warp, as you would with a blocksize of 1, the unused threads appear as bubbles in the pipeline, causing the SPs to be doing nothing for several clock cycles.
6 warps are needed to guarantee that register latency is hidden. If dependent instructions are spaced far enough apart, there is no stall due to instruction latency. In theory, 1 warp can fully utilize a single SM. But doing so is nearly impossible for most real applications.
I’m using a very simple kernel - addition of 2 long arrays. They’re in global memory, so perhaps the latency is really an issue.
In response to plmae’s comments, I do see the value of register latency hiding (I hadn’t realized that it takes 6 warps to hide the memory latency). This would mean that if I used <<<480,1>>>, I should still see a speed up of 180x. Let me try this.
Once we increase the number of grids beyond <<<240, 1>>>, the speedup stays saturated at 180x, even all the way to <<<3000, 1>>>. So, 6 warps are coming into play, to hide the latency.