About the scheduler inside a GPU SMX

Dear All

Supposing cc 3.5 (K40)

Does the scheduler the following:

  • SMX register memory permits 5 blocks of threads
  • But due 2048 thread limitation only 4 blocks run at a time

Does the scheduler prepares a 5th block in the available memory to be run when one of the 4 blocks ends (in advance, even before any blocks end)?

Thanks

Luis Gonçalves

What exactly does “prepare a block in the available memory” mean?

I find it difficult to imagine any such work short of actually starting the threads, which we have assumed are not available yet.

kernel1
grid size 1000
block size 512

Number registers per thread is such that 5 blocks fits in the available register memory of the SMX

Each SMX runs 4 blocks at a time due to 2048 threads limit in each SMX

When one of the blocks finishes is there a 5th block ready to run or must be scheduled yet?

This reads very much like a question about an XY-problem. What are you trying to accomplish?

Increase performance

Increase performance how exactly? I am unable follow the reasoning here. It seems you are seeking assurances that cannot be given since the details of the GPU scheduling mechanisms are not documented, and are subject to change.

If this is just for a specific part (which seems to be the case), why not just give it a try, and measure the performance? When it comes to GPUs, I usually recommend experimenting instead of theorizing, since so little information is available about the details of their microarchitecture.

So to state your problem in a slightly different way, inspired by njuffa’s post:

  • You have reached the maximum number of threads per SM (i.e. 100% occupancy)
  • However at 100% occupancy there are still registers left
  • You want to put those registers to good use in order to increase performance.

Is that a fair summary?

If that is the case, I would suggest looking for opportunities to increase instruction level parallelism, including aggressive use of restrict to allow early loads. Depending on what the software does, also look for opportunities of in-register table lookup and try to increase the number of common subexpression. All of these have a tendency to drive up register usage and benefit performance.

Especially in floating-point computation, the compiler won’t re-associate expression, with the exception of FMA contractions, to avoid changing the semantics of the computation; any re-association benefitting performance should be performed by the programmer.

There are a couple of ways in which code using more registers may perform faster. The compiler may even be able to find some itself if you just use a launch_bounds qualifier to inform the compiler of the available registers.

Yes, it is that in another words, tera

I am trying to explain why is better block size of 512 instead of 1024

With 512, the registers are enough for 5 blocks (only run 4 at a time)
With 1024, the registers are for 2 blocks (not enough for a 3rd block), (run 2 at a time)

Better performance with 512

I do not want to justify without basis

Experience indicates that in general, finer granularity (fewer threads per block) is often advantageous for GPU performance, all other parameters being equal. A good initial target for block size is 128 to 256 threads; possibly smaller with the latest GPU architectures. While there are a number of possible architectural explanations (e.g. unbalanced execution in large thread blocks, ramp-up and ramp-down effects at the start and end of kernels) for this, differences can also be due to hardware implementation artifacts (“butterfly effects”), especially for memory-intensive codes.

The interactions of multiple levels of scheduling, the specific sequences of loads and stores in each warp, and buffering and re-ordering in the memory controllers are very complicated, and cannot be modeled satisfactorily with publicly available information. In my work I have found that often there is no readily discernable cause-effect pattern in multi-dimensional shmoo plots of performance data, which in turn indicates that brute force auto-tuning based on multiple configuration parameters would be useful.

A look at the profiler statistics should highlight what metrics in particular are affected by the change in block configuration for your code, and I would suggest documenting that as immediate causes of the observed performance differences.

My experience is different. In one kernel, with block size of 32 performs worst than 32x16.

And in many kernels I increased the block size to increase performance

But even with 32x16 block size I have a grid size big enough to fill with processing all the SMXs.

Block size of 32 is good if I have lots of SMX and otherwise a small grid size (with 32x16 block size). That way (with 32) it is distributed equally (and fill the SMXs) the processing trough the SMXs

In other words, block size of 32 is good when there are few parallelism

For a K40, a block size of 32 threads is too small, because it limits occupancy severely, and this is specifically disqualified under “all other parameters being equal”.

I stated that good results are usually achieved with blocks of 128 to 256 threads, possibly less on Maxwell and Pascal. Another way of saying this is: it is often advantageous to make the block size as small as possible as good occupancy will allow.

yes, with maximum number of blocks of 16 by SMX and block size of 32, it only be 512 threads per SMX at a time