I’m playing with a simple program to add 100,000,000 floating point numbers. I started out on the GT 120 and have now moved to the GTX 285.
Initially with the GT 120, which has 4 processors and 32 cores, 2 or 4 blocks seemed optimal, but I didn’t test over 4.
After I started working on the GTX 285, which has 30 processors and 240 cores, it looks like 20 blocks is optimal (I expected 15 or 30 to be).
Re-testing the GT 120, it ends up 20 blocks is optimal on that card as well and significantly out performs 2 or 4 blocks.
So the question is, if you are writing code that can execute on different cards, how do you determine the optimal number of blocks to use? Can any one explain why 20 would be optimal on a GTX 285?
I’m working on a Mac, but I was looking for a general answer to the question.