We tried an experiment to evaluate scalability of our CUDA program. We’re using a Quadro FX 5800, with 240 cores.
Run a kernel with grid size 1 and block size 1. This would be expected to run on a single core of the GPU.
Repeat the experiment, with different grid and block sizes. When we run with sufficiently large grid and block sizes, we can expect a max. speed-up of 240 (since there are only 240 cores).
In particular, if the block size stays at 1, and the grid size is increased, then each SM will only use 1 core (out of 8) at a time. Here, the max speed up that can be achieved is 30 (since there are 30 SMs).
However, we’re seeing that:
a ) A kernel <<<240,1>>> runs 180 times faster than the kernel <<<1,1>>>.
b ) A kernel <<<200,1>>> runs 157 times faster than the kernel <<<1,1>>>.
c ) A kernel <<<100,1>>> runs 90 times faster than the kernel <<<1,1>>>.
d ) A kernel <<<10,1>>> runs 9 times faster than the kernel <<<1,1>>>.
So the question really is:
When a SM executes a half warp, can this half-warp consist of threads from multiple blocks? In Chapter 3 (page 14) of the CUDA Programming Guide Version 2.0,
When a multiprocessor is given one or more thread blocks to execute, it splits them
into warps that get scheduled by the SIMT unit. The way a block is split into warps
is always the same; each warp contains threads of consecutive, increasing thread IDs
with the first warp containing thread 0. Section 2.1 describes how thread IDs relate
to thread indices in the block.
This indicates that a warp is only made up of threads WITHIN a block. However, our performance numbers above indicate otherwise. Any ideas as to what is really going on?