Wisdom Around Optimal Number of Blocks in a Grid?

cnicotra · June 22, 2009, 4:31pm

I’m playing with a simple program to add 100,000,000 floating point numbers. I started out on the GT 120 and have now moved to the GTX 285.

Initially with the GT 120, which has 4 processors and 32 cores, 2 or 4 blocks seemed optimal, but I didn’t test over 4.

After I started working on the GTX 285, which has 30 processors and 240 cores, it looks like 20 blocks is optimal (I expected 15 or 30 to be).

Re-testing the GT 120, it ends up 20 blocks is optimal on that card as well and significantly out performs 2 or 4 blocks.

So the question is, if you are writing code that can execute on different cards, how do you determine the optimal number of blocks to use? Can any one explain why 20 would be optimal on a GTX 285?

I’m working on a Mac, but I was looking for a general answer to the question.

Thanks,
/Chris

tmurray · June 22, 2009, 4:48pm

without seeing your code, it’s impossible to say. but the short answer is that 20 is certainly not optimal in the general case because

you have 10 SMs free
one block per SM is usually not enough to hide memory latency effectively.

cnicotra · June 22, 2009, 11:13pm

So on the GTX 285, with 30 processors, you’d expect the optimal number of blocks to a multiple of 30?

Is there any way to determine if all of the processors are being used?

/Chris

tmurray · June 22, 2009, 11:17pm

If you’re using less than 30 you’re definitely not using all SMs, no.

cnicotra · June 23, 2009, 2:16am

It’s almost 50% slower if I use 30 blocks, so that sounds like my card is not using all 20 SMs. :-(

/Chris

tmurray · June 23, 2009, 2:25am

And again, you haven’t posted any sort of code to see if you’re making a dumb mistake somewhere.

cnicotra · June 23, 2009, 3:25am

Source code is posted below in one of the replies as an attachment. I removed the inline version to make this easier to read.

/Chris

cnicotra · June 23, 2009, 10:39pm

Is there some way to profile and see how many SM’s are being used and by what blocks?

/Chris

Tobi_W · June 24, 2009, 9:42am

You can use the occupancy calculator to get the number of possible blocks per SM. But if you have 20 blocks on a gpu with 30 SMs, your application will just use 20 SMs and suffer from memory latency!

cnicotra · June 24, 2009, 10:28am

That’s the issue, I have 30 SMs, but I get almost twice the performance if I only use 20 blocks, so something seems wrong. It almost seems like it is not able to use the last 10 SMs for some reason.

Thanks,

/Chris

cnicotra · June 24, 2009, 11:17pm

I’ve stripped down the original test program to not use any command line options, run each block size test 100 times, and just measure the single floating point sum kernels.

Would someone be willing to download this and build it on a PC, run it for me, and send me the resulting output avg.csv file?

/Chris

avg2.zip (16.2 KB)

cnicotra · June 27, 2009, 5:31pm

Changed the attachment in the previous post (avg2.zip). Would someone please run it for me on a PC with a 30 SM NVidia card? :-)

Thanks,

/Chris

Topic		Replies	Views
Would someone run this on a high end Mac card? CUDA Programming and Performance	4	6628	July 10, 2009
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10515	July 7, 2009
block numbers related to the number of SMs blocks in multiple SMs CUDA Programming and Performance	1	1454	December 1, 2009
GTX 285 Performance vs GT 120 CUDA Programming and Performance	9	7897	June 23, 2009
Basic Cuda Confusion - help CUDA Programming and Performance	9	2001	February 11, 2013
Doubt regarding grid size and block size CUDA Programming and Performance	7	7300	October 17, 2010
How blocks will be distributed among SPs ? CUDA Programming and Performance	4	1619	October 13, 2008
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2918	October 27, 2008
Question about the number of SMs using in the program. CUDA Programming and Performance	3	860	April 9, 2018
How to use "block" and "thread" CUDA Programming and Performance	5	1324	October 16, 2013

Wisdom Around Optimal Number of Blocks in a Grid?

Related topics