I’m trying to examine the relation of performance and number of blocks. What I do here is increase blocks from 1/Multiprocessor to 8/Multiprocessor. The performance suddenly drop when I use 90 blocks, i.e. on average 6 blocks/Multiprocess given that GTX480 has 15 Multiprocessors. I don’t know how to explain this…
(The same kernel, in which I use both thread level parallelism and instruction level parallelism. 27 registers/thread is used)
I’m also trying different block sizes. The big drops at block 90 are obvious when the block size is small(i.e. 32, 64, 96, 128, 160, 192, 224) and they are relatively flat when the block size are getting larger than that range.
It is probably just an occupancy phenomena. With less than a critical number of blocks, the execution time is pretty close to the execution of one block. One every MP reaches full occupancy, increasing the block count will cause the execution time to jump to close to that of two blocks, because at least one block must wait until a MP has resources to launch it.
To be specific, the GTX480 has 15 MPs, each of which has 32 cores. What I’m doing is calling the kernel with 15 blocks first and assume each block will simply assigned to each one of the MPs and therefore the 15 blocks are running concurrently on all 15 MPs. After that, I try to increase the blocks to 30, 60, 90, and 120 so as to hide the memory access latency and at the same time try to find the threshold where the throughput saturates.
Basically, I guess as the increase of overall blocks, the blocks assigned to each of the MPs will increase accordingly(I think it’s also evenly). In this range, the throughput should be linearly up because of the gradually effective pipelining in terms of hiddening the latency. However, when all the cycles of the latency has been hidden by a certain number of blocks on each MP, the throughput is supposed to be saturated, meaning further increasing the blocks, no further acceleration gain and the throughput line, I guess, should be relatively flat. My confusion is I never though there should be a huge drop in the flat range. What I observed from data is that the throughput goes up at first linearly and when it’s about to go flat, it drops to a cancave when I call using 90 (6 blocks/MP) and 120 blocks overall and then come back up to the flat threshold when using 150 blocks.