How are blocks distributed over the SMs? Strange scaling over the number of blocks in a kernel call

Hi,

I’m working on a somewhat compute-intense kernel which I execute in N blocks of 32 threads each. These blocks process a large workload which can be equitably distributed over a large range of N.

I’m testing this kernel on a GTX 480, which has 15 SMs with 32 cores each and looking at the performance for different values of N=1…120, i.e. the maximum number of blocks given 15 SMs.

I was expecting the following behaviour:

    For N=1…15, parallel efficiency should be rather good, since each block can run on its own SM.

    As of N=15, we have more blocks than SMs and therefore the performance per block will start to decay since certain blocks will have to share the resources of a common SM.

    As of N=120, the code shouldn’t scale at all since at most 8 blocks can be executed concurrently per SM.

This is more or less the behaviour I observe, yet with the following caveat: The parallel efficiency starts decaying as of N=12.

At N=12, I get a parallel efficiency of >99.5%. At N=13 I get 99.26%, at N=14 it’s 99.07%. This rate of decay continues until I get, for N=120, 77.13%. As of N=120, the performance does not scale at all.

So here’s the question: Why does this drop-off occur at N=12 and not, as I would expect, at N=15? Is there something odd about the scheduling of blocks to SMs that will make the GPU not fill all SMs at N=15?

I can exclude that there’s any other computation using the GPU. I have a separate adapter for the attached monitor, I use the machine remotely, and, if there were any other tasks, I would not be able to schedule 120 blocks, e.g. scaling would stop before that.

Any help in understanding this is much appreciated!

Cheers,

Pedro

You need to check occupancy and also run the profiler, this will tell how are the resources used, if the code is memory bound or instruction bound and also give you some hints about how you can improve it. I can tell that using only 32 threads per block is very inneficient because there are not enough threads to hide the latencies.

Hi pasoleatis,

I have already profiled the code extensively and know for sure that memory throughput is not an issue. If memory bandwidth was an issue, I wouldn’t see such a sharp change in scaling and the total scaling would be much worse.

I use 32 threads per block for very specific reasons that are not relevant to the problem. High occupancy is not necessarily the best way to go, as pointed out by Vasily Volkov in this webinar.

In any case, both points are not an explanation as to why performance drops-off sharply at N=12 instead of at N=15, as would be expected.

Cheers,

Pedro

How long does the kernel take to run? It’s hard to guess what other effects might be at play here without known how much time 0.24% corresponds to.

Hi seibert,

On a single block, the kernel runs for 667 ms. It’s not the lost performance that bothers me, since in any case I’m only interested in the performance at 120 blocks, not at 15, but that the scheduler seems to be doing something weird.

The timings are taken over 2000 runs of the same kernel and the standard deviation of the timings is something like 0.2 ms, so the drop in performance is real, not just a measurement error. There is also no change in the performance at N=15, which is what I would have expected.

I’ll try to post a plot of the parallel efficiency, so you can see what I’m talking about.

Cheers,

Pedro

Hi again,

Just to make the question clearer, here’s a plot of the parallel speedup and efficiency over the number of blocks:

Notice the almost perfect scaling until N=12 (yes, it really is N=12, I just don’t have a zoomed plot), followed by the even decay in performance until N=120.

Cheers,
Pedro

Seibert has a very good point, 0.24 % is very small change in efficiency. Is it possible that the loss is related to the hardware part which schedules the blocks? I think there is some penalty for creating more blocks and threads.

Hi pasoleatis,

I don’t think so, otherwise this loss would be uniform over all N. My problem is the notable kink at N=12, when it would be expected at N=15.

Cheers,

Pedro

I don’t remember the details but I think there some registers that you can access that actually tell you what SM your thread is running in. I think there might be a clock or performance counter you can access too. I think you could use this to profile how many blocks run on each SM and how long each one takes.

I second shawkie’s suggestion: keep a record of the [font=“Courier New”]%smid[/font] (accessible from PTX inline assembly) for each block executed. IIRC compute capability 2.x devices sometimes start scheduling more than one block/SM before all SMs are loaded, because keeping a perfect balance would have required more silicon. There has been a thread on the forums about this before, where Nvidia employees stated this.

EDIT: Just to make sure I understand you correctly, how do you define speedup and parallel efficiency?

Hi shawkie & tera

Thanks for the suggestion, I’m trying to get this to work via [font=“Courier New”]__prof_trigger(…)[/font], but the profiler is giving me a hard time. Will let you know of the results as soon as I have them.

If T[sub]k[/sub] is the time required using k blocks, the parallel scaling is T[sub]1[/sub]/T[sub]k[/sub] and the parallel efficiency is T[sub]1[/sub]/T[sub]k[/sub]/k. The scaling is just the speedup factor whereas the efficiency is the ratio of observed speedup vs. perfect speedup.

Question, T_0 = “Time using zero blocks”? which would logically be zero seconds. Do you mean T_1 in this case?

It would be interesting to plot the actual performance utilization GFLOPS_achieved / GFLOPS_theoretical or bandwidth_achieved / bandwidth_theoretical with the number of blocks. You will likely reach a peak for a very high number of blocks and not 15 or 120.

Hi Jimmy,

Oops, sorry, fixed that.

Actually, 120 blocks is the maximum I can schedule on 15 SMs.

I’m also not really interested in GFLOPS but in the total execution time. The problem requires quite a bit of logic and not necessarily much math. I could easily dispense with the logic and add more math and get a better GFLOP rate, but the code as a whole would take longer to execute.

But all this has nothing to do with the question of how blocks are scheduled on the SMs.

Cheers,

Pedro

Hi again,

I’ve managed to get the [font=“Courier New”]__prof_trigger(…)[/font] commands to work and to collect the data (using the command-line profiler) and it would seem that, effectively, for N=15 blocks, some SMs are assigned more than one block and some SMs idle.

This, however, doesn’t answer the question of why this happens or if it can be avoided somehow.

Cheers,
Pedro

You may want to try upgrading your driver. I remember noticing this on a GTX 470 (Figure 6 in this paper http://gdiamos.net/papers/ocelot-instrumentation.pdf ), and that it went away with a driver update.

Hi Gregory,

Thanks for the link to the paper, it looks quite interesting!

I did the computations both with the 295.41 driver, and with the latest 302.06.03 driver, both to the same effect.

Cheers,

Pedro