I haven’t seen any discussions on optimal thread dimensions for launching kernels (although am sure to be corrected on this point!), as a start I include some code I have been working on. The problem is split into two different types depending whether the required number of threads is prime or not. If it isn’t prime (or more strictly has no prime factors<=max thread size for x or y) the problem becomes one of minimising the number of extra duff threads created (hopefully less than a warps-worth).
The attached code creates optimum grids for the first group, and makes an attempt on the second group.
On testing (with up to 1000000 threads) it gets a perfect grid 34.8% of the time, and has a maximum excess thread count of 719.
Can anyone improve this?
Simon
optimise.cu (3.88 KB)