Hi all,
Please know that I first googled this question extensively and read several search results on this forum. But I’m not sure what is the current/updated answer, as the architecture has changed so dramatically over the years, especially with the advent of FERMI.
I used to have an old GT 8600 with 4 SM (yeah, ouch!). At that time, I was playing around with CUDA a bit and trying to learn some basics. This was about two years ago. At that time, someone said GPUs reach peak efficiency at 20,000 threads…meaning, try not to launch more than that. If you have more “stuff” to do in your code, write for loops, allowing these 20,000 threads to iterate and keep working.
Fast forward 2 years later to now.
Reading the CUDA by Example book and the programming guide, I’m seeing examples of launching millions of threads.
For example, I have a FERMI-based GTX 460, which has 336 cores. Assuming I want each core to be busy, I need 336 blocks. And I read that 256 threads per block is ideal (err…at least it was 2 years ago). So that equates to more than 80 thousand threads.
So with the advent of FERMI, what is best here? My guess is that I want to use each and every core. My naive algorithm was previously launching upwards of 1 million threads (or a LOT more). Two years ago, someone mentioned that I should launch a much smaller amount, closer to 10,000 or so, and have those threads iterate over for loops.
But what if I only launch 10,000 threads on my GTX 460. At a block size of 256, that’s only 40 cores being used! …out of 336…so clearly not taking advantage of this robust card.
Bottom line, I’m trying to find a (current/updated) resource that explains this in detail for the small people like me. If someone asked me to guess, I would think, with my card, 80 thousand threads would be the MINIMUM I would want to launch, thereby using all 336 cores and maintaining the 256 threads per block.
Is there a section of the programming guide or some other (updated and current) resource that explains this?
Thanks in advance.