The answer can be found in Appendix A of the programming guide - the magic number is 512. Be aware that there are also register and shared memory limitations which might prevent a block with less than or equal to 512 threads from launching.
In my experimenting with a GTX 285 (30 SMs) and a GT 120 (4 SMs) and 512 was consistently the optimal number of threads.
I was also trying to find out the optimal number of blocks which has a dramatic impact on the performance of my test application. It seemed like a multiple of the number of SM’s would be ideal, but on both of these devices 20 blocks appears to be optimal.
Attached is the source to a program that adds 1,000,000 floating point numbers and tries various block configurations. I’ve not had much luck getting anyone else on a non-Mac platform to run it for me, but I’d love to find out if other 30 SM processsor cards exhibit the same performance characteristics.
There is a spreadsheet in the file that shows the performance at block sizes 4 to 200. Below 4, it was way slower, so I didn’t include it in the graph.
Each block is executed on a single SM, so if you have 30 SMs and only 20 blocks then 10 SMs would be idling, plus you don’t have more than 1 block per SM which is bad in terms of hiding latency.
cnicotra, I must say your code and your results looks weird. I don’t even understand what happens there but you seem to be adding numbers in loops in a serial fashion. You actually penalize having smaller blocks by doing:
n = num / blockDim.x; //n is higher if blocks are smaller
and then
n2 = n >> 2;
for (i=0;i<n2;i++)
{
sum += *d_data;
d_data += blockDim.x;
sum += *d_data;
d_data += blockDim.x;
sum += *d_data;
d_data += blockDim.x;
sum += *d_data;
d_data += blockDim.x;
}
This causes loads of serial global memory accesses. Evil.
Ie. the smaller blocks you use, the longer the loops get, the more work each thread has. And this is serial work.
It’s definitely NOT optimal to have 20 blocks and 512 threads per block all the time, IIRC you’ve posted this in a different thread before. If you have less blocks than multiprocessors, as you do when you use GTX285, some MPs are idling! If your app indeed gets slower with different configs, it’s a good indication you didn’t understand how CUDA parallelism works and you’ve done it the wrong way.
512 threads per block can be wasteful on compute 1.0 and 1.1 cards as they can run up to 768 threads per MP. 1 block of 512 threads occupying an MP will waste the remaining 256 threads, this results in low occupancy (not necessarily low performance, but still). Remember, blocks can’t get split between MPs. There can be multiple blocks assigned per MP but a single block cannot be distributed across several MPs.
By the way
Those are two different questions.
Besides, there’s no overhead with having more threads than the GPU has processors. In fact, this is recommended. Use as many as you like, the more the better (in most cases).
Since CUDA has a limit of 512 threads, I assume that means there is no way to use their capability of using up to 768?
Interesting that the difference ratio between 768 and 512 is 1.5, so maybe that explains why 20 is the optimal number of blocks (20 * 1.5 = 30 SMs)?
Unfortunately, I don’t understand the comment on the may loop causing serialization. What would be the alternative? I tried approaches like folding in a binary fashion, like I did to add up the numbers from the threads, but after tweaking this to use the partially unrolled loop above, this approach ended up being much faster.
Interestingly, 50 blocks, which would have even smaller blocks, is almost as optimal as 20.
I really appreciate your help and feedback. Thanks!!! :-)
Yes there is. For example, if you use 256 threads per block and have enough blocks, then an SM can have 3 active block simultaneously, resulting in 768 concurrent active threads.
No. There’s no way a block is split between two SMs, so an SM can’t process 1.5 blocks.
I bet you do :) But that’s because the code you wrote is not suited for the speed test you’re trying to perform.
If you run your program through the visual profiler, you will see that the number of uncoalesced loads/stores as well as the number of divergent branches are dependent on the grid size.
In my ignorance, I think I’ve dealt with the uncolalesced issue. Originally, I had each thread working on a different chunk of memory, when I moved to interleaving them, things got an order of magnitude faster.
I have the project loaded in the visual profiler, but I don’t see how to see uncoalesced loads / stores?
Max 1024 (or 768 for older GPUs) threads can be scheduled per multiprocessor at any given time, that’s over 30k of threads for a Tesla 10/GTX 280. You should treat this number as the MINIMAL number of threads you’re aiming for as this will just cover all the pipelines.
The reason the number of blocks had such a huge impact on the performance and the times seemed to swing so wildly had to do with the alignment of the data each block was starting with. By forcing the start address of each block to read from a 64 byte boundary, the performance became much more consistent with various number of blocks. 20 blocks is still optimal for this case, but 30 blocks is vary close. In the low number of blocks above 10, there is some minor variation in performance, as the number of blocks increases above 90, performance becomes fairly constant.
The Blocks.gif image is an Excel plot of the performance at various block sizes.