Hi there.
I’m working with several CUDA machines: GTX 260 at home, 9800 GT and 4x Tesla C1060 at work. NVIDIA CUDA Programming guide is read all-through.
I’ve wrote a simple test kernel: it calculates around million sines without any requests to global or shared memory. In other words, this kernel is pure mathematic. It uses 1 register per thread. I launch it at different launch configurations, measure its working time (using cudaStreamSynchronize(0); to wait for kernel, of course) and calculate performance (number of sines per second).
Things are clear about blocks: one or more blocks (up to 8) can reside on a single MP, while 1 block can’t be split to work on 2 MPs. So peak performance is achieved if number of blocks is a multiple of number of MPs. The graph of “performance vs blocks count” is a “saw” with “teeth” period equal to MP number (14 for my 9800 GT):
This is clear: GPU processes up to 14 blocks at once, so, for example, with 21 blocks, half MPs are idle on the second block processing step, thus reducing performance.
I still can’t understand how threads are distributed and processed inside a multiprocessor. I passed the same test varying number of threads in block with 14 and 28 blocks total, and this is what I see:
(for 14 blocks)
(for 28 blocks)
I see, that 32*8=256 threads is a hardware limit of the amount of threads being processed at once, as long as it gives maximum performance. But official programming guide says, that this limit must be 768 threads which is 3 times more (Compute Capability 1.2).
On the other hand, this means, that every MP core processes the whole warp at once, as long as MP processes 8 warps at once, but in this case 9 warps would be processed in 2 cycles, similar to block processing, but they are actually processed at around 1,125 cycles.
On the other hand, these graphs also contain “saw” with period of 32 threads or 1 warp. I can hardly understand this on friday evening :)
So, if anyone knows, please explain:
- How many threads a single MP can process at once physically?
- What “Maximum number of resident warps per multiprocessor” means in Programming Guide?
- How many threads a single MP core actually processes at once?
- How a single block is actually split between multiprocessor cores?
Thanks.