Wisdom Around Optimal Number of Blocks in a Grid?

I’m playing with a simple program to add 100,000,000 floating point numbers. I started out on the GT 120 and have now moved to the GTX 285.

Initially with the GT 120, which has 4 processors and 32 cores, 2 or 4 blocks seemed optimal, but I didn’t test over 4.

After I started working on the GTX 285, which has 30 processors and 240 cores, it looks like 20 blocks is optimal (I expected 15 or 30 to be).

Re-testing the GT 120, it ends up 20 blocks is optimal on that card as well and significantly out performs 2 or 4 blocks.

So the question is, if you are writing code that can execute on different cards, how do you determine the optimal number of blocks to use? Can any one explain why 20 would be optimal on a GTX 285?

I’m working on a Mac, but I was looking for a general answer to the question.

Thanks,
/Chris

without seeing your code, it’s impossible to say. but the short answer is that 20 is certainly not optimal in the general case because

  1. you have 10 SMs free
  2. one block per SM is usually not enough to hide memory latency effectively.

So on the GTX 285, with 30 processors, you’d expect the optimal number of blocks to a multiple of 30?

Is there any way to determine if all of the processors are being used?

/Chris

If you’re using less than 30 you’re definitely not using all SMs, no.

It’s almost 50% slower if I use 30 blocks, so that sounds like my card is not using all 20 SMs. :-(

/Chris

And again, you haven’t posted any sort of code to see if you’re making a dumb mistake somewhere.

Source code is posted below in one of the replies as an attachment. I removed the inline version to make this easier to read.

/Chris

Is there some way to profile and see how many SM’s are being used and by what blocks?

/Chris

You can use the occupancy calculator to get the number of possible blocks per SM. But if you have 20 blocks on a gpu with 30 SMs, your application will just use 20 SMs and suffer from memory latency!

That’s the issue, I have 30 SMs, but I get almost twice the performance if I only use 20 blocks, so something seems wrong. It almost seems like it is not able to use the last 10 SMs for some reason.

Thanks,

/Chris

I’ve stripped down the original test program to not use any command line options, run each block size test 100 times, and just measure the single floating point sum kernels.

Would someone be willing to download this and build it on a PC, run it for me, and send me the resulting output avg.csv file?

/Chris

avg2.zip (16.2 KB)

Changed the attachment in the previous post (avg2.zip). Would someone please run it for me on a PC with a 30 SM NVidia card? :-)

Thanks,

/Chris