How are blocks distributed over the SMs? Strange scaling over the number of blocks in a kernel call

PedroUK · June 24, 2012, 11:40am

Hi,

I’m working on a somewhat compute-intense kernel which I execute in N blocks of 32 threads each. These blocks process a large workload which can be equitably distributed over a large range of N.

I’m testing this kernel on a GTX 480, which has 15 SMs with 32 cores each and looking at the performance for different values of N=1…120, i.e. the maximum number of blocks given 15 SMs.

I was expecting the following behaviour:

[*] For N=1…15, parallel efficiency should be rather good, since each block can run on its own SM.

[*] As of N=15, we have more blocks than SMs and therefore the performance per block will start to decay since certain blocks will have to share the resources of a common SM.

[*] As of N=120, the code shouldn’t scale at all since at most 8 blocks can be executed concurrently per SM.

This is more or less the behaviour I observe, yet with the following caveat: The parallel efficiency starts decaying as of N=12.

At N=12, I get a parallel efficiency of >99.5%. At N=13 I get 99.26%, at N=14 it’s 99.07%. This rate of decay continues until I get, for N=120, 77.13%. As of N=120, the performance does not scale at all.

So here’s the question: Why does this drop-off occur at N=12 and not, as I would expect, at N=15? Is there something odd about the scheduling of blocks to SMs that will make the GPU not fill all SMs at N=15?

I can exclude that there’s any other computation using the GPU. I have a separate adapter for the attached monitor, I use the machine remotely, and, if there were any other tasks, I would not be able to schedule 120 blocks, e.g. scaling would stop before that.

Any help in understanding this is much appreciated!

Cheers,

Pedro

pasoleatis · June 24, 2012, 11:50am

You need to check occupancy and also run the profiler, this will tell how are the resources used, if the code is memory bound or instruction bound and also give you some hints about how you can improve it. I can tell that using only 32 threads per block is very inneficient because there are not enough threads to hide the latencies.

PedroUK · June 24, 2012, 12:17pm

Hi pasoleatis,

I have already profiled the code extensively and know for sure that memory throughput is not an issue. If memory bandwidth was an issue, I wouldn’t see such a sharp change in scaling and the total scaling would be much worse.

I use 32 threads per block for very specific reasons that are not relevant to the problem. High occupancy is not necessarily the best way to go, as pointed out by Vasily Volkov in this webinar.

In any case, both points are not an explanation as to why performance drops-off sharply at N=12 instead of at N=15, as would be expected.

Cheers,

Pedro

seibert · June 24, 2012, 4:26pm

How long does the kernel take to run? It’s hard to guess what other effects might be at play here without known how much time 0.24% corresponds to.

PedroUK · June 24, 2012, 4:55pm

Hi seibert,

On a single block, the kernel runs for 667 ms. It’s not the lost performance that bothers me, since in any case I’m only interested in the performance at 120 blocks, not at 15, but that the scheduler seems to be doing something weird.

The timings are taken over 2000 runs of the same kernel and the standard deviation of the timings is something like 0.2 ms, so the drop in performance is real, not just a measurement error. There is also no change in the performance at N=15, which is what I would have expected.

I’ll try to post a plot of the parallel efficiency, so you can see what I’m talking about.

Cheers,

Pedro

PedroUK · June 24, 2012, 5:03pm

Hi again,

Just to make the question clearer, here’s a plot of the parallel speedup and efficiency over the number of blocks:

External Media

Notice the almost perfect scaling until N=12 (yes, it really is N=12, I just don’t have a zoomed plot), followed by the even decay in performance until N=120.

Cheers,
Pedro

pasoleatis · June 24, 2012, 5:13pm

Seibert has a very good point, 0.24 % is very small change in efficiency. Is it possible that the loss is related to the hardware part which schedules the blocks? I think there is some penalty for creating more blocks and threads.

PedroUK · June 24, 2012, 5:21pm

Hi pasoleatis,

I don’t think so, otherwise this loss would be uniform over all N. My problem is the notable kink at N=12, when it would be expected at N=15.

Cheers,

Pedro

shawkie · June 24, 2012, 7:05pm

I don’t remember the details but I think there some registers that you can access that actually tell you what SM your thread is running in. I think there might be a clock or performance counter you can access too. I think you could use this to profile how many blocks run on each SM and how long each one takes.

tera · June 25, 2012, 1:15am

I second shawkie’s suggestion: keep a record of the [font=“Courier New”]%smid[/font] (accessible from PTX inline assembly) for each block executed. IIRC compute capability 2.x devices sometimes start scheduling more than one block/SM before all SMs are loaded, because keeping a perfect balance would have required more silicon. There has been a thread on the forums about this before, where Nvidia employees stated this.

EDIT: Just to make sure I understand you correctly, how do you define speedup and parallel efficiency?

PedroUK · June 25, 2012, 8:55am

Hi shawkie & tera

Thanks for the suggestion, I’m trying to get this to work via [font=“Courier New”]__prof_trigger(…)[/font], but the profiler is giving me a hard time. Will let you know of the results as soon as I have them.

If T[sub]k[/sub] is the time required using k blocks, the parallel scaling is T[sub]1[/sub]/T[sub]k[/sub] and the parallel efficiency is T[sub]1[/sub]/T[sub]k[/sub]/k. The scaling is just the speedup factor whereas the efficiency is the ratio of observed speedup vs. perfect speedup.

Jimmy_Pettersson · June 25, 2012, 9:35am

Question, T_0 = “Time using zero blocks”? which would logically be zero seconds. Do you mean T_1 in this case?

It would be interesting to plot the actual performance utilization GFLOPS_achieved / GFLOPS_theoretical or bandwidth_achieved / bandwidth_theoretical with the number of blocks. You will likely reach a peak for a very high number of blocks and not 15 or 120.

PedroUK · June 25, 2012, 9:45am

Hi Jimmy,

Oops, sorry, fixed that.

Actually, 120 blocks is the maximum I can schedule on 15 SMs.

I’m also not really interested in GFLOPS but in the total execution time. The problem requires quite a bit of logic and not necessarily much math. I could easily dispense with the logic and add more math and get a better GFLOP rate, but the code as a whole would take longer to execute.

But all this has nothing to do with the question of how blocks are scheduled on the SMs.

Cheers,

Pedro

PedroUK · June 25, 2012, 10:03am

Hi again,

I’ve managed to get the [font=“Courier New”]__prof_trigger(…)[/font] commands to work and to collect the data (using the command-line profiler) and it would seem that, effectively, for N=15 blocks, some SMs are assigned more than one block and some SMs idle.

This, however, doesn’t answer the question of why this happens or if it can be avoided somehow.

Cheers,
Pedro

Gregory_Diamos · June 25, 2012, 11:35pm

You may want to try upgrading your driver. I remember noticing this on a GTX 470 (Figure 6 in this paper http://gdiamos.net/papers/ocelot-instrumentation.pdf ), and that it went away with a driver update.

PedroUK · June 25, 2012, 11:46pm

Hi Gregory,

Thanks for the link to the paper, it looks quite interesting!

I did the computations both with the 295.41 driver, and with the latest 302.06.03 driver, both to the same effect.

Cheers,

Pedro

Topic		Replies	Views
More blocks than SMs may not make sense CUDA Programming and Performance	13	2915	November 11, 2010
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2934	October 27, 2008
What will be happen in the situation CUDA Programming and Performance	9	6359	December 23, 2008
SM work efficient lower when active-SM more? CUDA Programming and Performance	0	559	July 7, 2013
Basic Cuda Confusion - help CUDA Programming and Performance	9	2025	February 11, 2013
Theoritical active blocks per SM differ from the ones actually running at runtime CUDA Programming and Performance	14	1092	September 25, 2023
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	383	December 31, 2024
Number of Threads vs Number of Blocks in GPU Kernel CUDA Programming and Performance	4	8850	July 16, 2017
Bug report: Incorrect block scheduling CUDA Programming and Performance	18	7891	February 19, 2010
Scheduling block execution Do multiprocessors block each other? CUDA Programming and Performance	45	23243	June 7, 2010

How are blocks distributed over the SMs? Strange scaling over the number of blocks in a kernel call

Related topics