I have to implement a simple algorithm and I wonder what is the perfect number of threads per block. I’ve read something like 16 or 32 but I’d like to be shure. All the threads of a block do not share anything.
Thanks for your answer!
256 is optimal if you do not have any syncthreads__ in your program and use not too many registers.
Perfect! Thank you!
The optimal is either 32, 64, … or any other multiple of 32 up to 512. Every kernel will behave differently due to occupancy and memory access patterns. Benchmark your kernel on all sizes (if you can choose an arbitrary size, that is) to find the fastest. It is worth it: some of my kernels change performance by more than 50% depending on the block size.
When there is no interaction between threads in a block I would be surprised to see 512 threads perform better than 256 threads per block.
With 8 registers and less per thread it is likely to give better performance IMO.
But with 512 threads you will only have 1 block per multiprocessor for a total of 512 threads per MP. With 256 threads you will have 3 blocks per multiprocessor for a total of 768 threads per MP. So you should get better latency-hiding (and you will likely be bandwith-bound with a kernel as simple as that)
Agreed, but it is definitely worth trying larger blocks for kernels with tiny register usage.
measuring = knowing, that is for sure. I was just wondering if you ever encountered a case like that. For kernels that use shared memory I can easily see the possibility, kernels that don’t would be very surprising for me (and mean that I understand less than I hoped ;))
In general, 256 works best for me. In some cases 192 was somewhat faster, but I’ve never seen 512 threads give an improvement above 256.