Optimum thread dimensions for kernel launch? Algorithms for optimal thread dimensions

I haven’t seen any discussions on optimal thread dimensions for launching kernels (although am sure to be corrected on this point!), as a start I include some code I have been working on. The problem is split into two different types depending whether the required number of threads is prime or not. If it isn’t prime (or more strictly has no prime factors<=max thread size for x or y) the problem becomes one of minimising the number of extra duff threads created (hopefully less than a warps-worth).
The attached code creates optimum grids for the first group, and makes an attempt on the second group.

On testing (with up to 1000000 threads) it gets a perfect grid 34.8% of the time, and has a maximum excess thread count of 719.
Can anyone improve this?

optimise.cu (3.88 KB)

Your attachment doesn’t work, so forgive me in advance if this is a naïve question, but what is a "perfect"grid? In the absence of actual code to look at, is your approach any different to this, which popped up on stackoverflow a couple of months ago?