I’m playing with a simple program to add 100,000,000 floating point numbers. I started out on the GT 120 and have now moved to the GTX 285.
Initially with the GT 120, which has 4 processors and 32 cores, 2 or 4 blocks seemed optimal, but I didn’t test over 4.
After I started working on the GTX 285, which has 30 processors and 240 cores, it looks like 20 blocks is optimal (I expected 15 or 30 to be).
Re-testing the GT 120, it ends up 20 blocks is optimal on that card as well and significantly out performs 2 or 4 blocks.
So the question is, if you are writing code that can execute on different cards, how do you determine the optimal number of blocks to use? Can any one explain why 20 would be optimal on a GTX 285?
I’m working on a Mac, but I was looking for a general answer to the question.
You can use the occupancy calculator to get the number of possible blocks per SM. But if you have 20 blocks on a gpu with 30 SMs, your application will just use 20 SMs and suffer from memory latency!
That’s the issue, I have 30 SMs, but I get almost twice the performance if I only use 20 blocks, so something seems wrong. It almost seems like it is not able to use the last 10 SMs for some reason.
I’ve stripped down the original test program to not use any command line options, run each block size test 100 times, and just measure the single floating point sum kernels.
Would someone be willing to download this and build it on a PC, run it for me, and send me the resulting output avg.csv file?