Hello
I’m japanese student.
I have question about “thread” and “block”.
How to use which gets the best performance of “thread” and “block”?
GeForce GTX650
Sorry.
I’m not so good at English.
Hello
I’m japanese student.
I have question about “thread” and “block”.
How to use which gets the best performance of “thread” and “block”?
GeForce GTX650
Sorry.
I’m not so good at English.
There is no simple one rule for best Block size - it depends on your CUDA Kernals, their register usage, shared memory usage and also how many threads needed for best occupancy.
I would suggest you review some of the introduction to CUDA recorded webinars which you can find on GTC 2022: #1 AI Conference and also the CUDACasts on Youtube
There is also a good explanation in the CUDA programming guide on www.docs.nvidia.com
Finally once you understand the factors - we have a CUDA Occupancy Calculator, which is referenced in the manual which will help to see the impact of some the elements for your situation.
Good Luck
I recommend you to read CUDA by example book. It is a little old, but it will give good idea how to write cuda programs. The CUDA Programming Guide is also a very good document and not too hard to read.
I’m a bit confused as well, but does it matter much for calculations with millions of elements? If you’re using 128,256,512,1024 threads per block, but as long as you are utilizing the maximum 1024 threads/SM, that is what is important?
A SM can ran more than one block and it can 1536 max threads active for cc 2.0 and 2048 for cc 3.5. In the case of a Fermi card (cc 2.0) even if you have 768 threads per block you could have 2 blocks executing a total of 1536 threads per SM, while if you have 1024 you will have only 1024 threads active on 1 SM and a lower practical occupancy.
The optimal number of threads per block very much depends on the kernel you are using. In the kernels that I am developing, using a large number of threads often is not optimal because I have a large register count or shared memory usage. In my case, using 64 or 128 threads per block is often a good value for Fermi GPUs. I don’t have any Kepler GPUs at my disposal so this might change with different hardware.