Does the new independent thread scheduling give better performance?

I am looking for a simple example where the new independent thread scheduling would show its performance advantages in comparison with the previous per warp, lock-step mode execution.

Looking at a simple if/else example, just as presented here: https://devblogs.nvidia.com/inside-volta/ (see fig. 11 and fig. 12), I don’t see any speedup for the new approach.

Moreover, the same page says:
(1) “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units”
AND
(2) “Note that execution is still SIMT: at any given clock cycle CUDA cores execute the same instruction for all active threads in a warp just as before”

So, even if each thread has its own call stack and program counter, the scheduler does not create a heterogeneous mix of threds: threads running in parallel will always belong to the same warp and will always execute the same instruction. So, apart from changing execution order for A, B, X, Y, what else changes? When is it faster?

Any help would be very much appreciated.
Thanks!

I wouldn’t say the Independent Thread Scheduling is faster. It enables finer-grain synchronization and cooperation between parallel threads in a program. Per the blog post, statements from the if and else branches in the program can now be interleaved in time, allowing for more complex algorithms.

You can find more information in the Volta whitepaper https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

I don’t recall it being marketed as faster…

Thanks so much for the quick reply.

In the link you sent me, it says: “Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. To maximize parallel efficiency

I still can’t find an example or understand where this independent thread scheduling would help in “making better use of execution resources” or “maximize parallel efficiency.”

unless… the “SIMT units” mentioned in “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units” are actually the processing blocks of a SM for a C.C. 7.x GPU. Since a processing block has 16 cores, the 32 threads of a warp can be split in two groups of 16 threads, each possibly executing a different instruction. Otherwise, if the SIMT unit would be of size 32, as for previous GPU architectures, then I don’t see any difference between having this independent thread scheduling or not. Here’s an example:

if (threadIdx.x %2)
   A; // say A takes x clock cycles
else
   B; // say B takes x clock cycles

CC < 7.0

  • SIMT unit <- 32 threads
  • Execution:
    A <- x clock cycles, 32 threads, even threads are masked
    B <- x clock cycles, 32 threads, odd threads are masked

TOTAL: 2*x cycles

CC 7.x, no independent thread scheduling

  • SIMT unit <- 16 threads
  • Execution:
    A <- x clock cycles, first 16 threads, even threads are masked
    A <- x clock cycles, second 16 threads, even threads are masked
    B <- x clock cycles, first 16 threads, odd threads are masked
    B <- x clock cycles, second 16 threads, odd threads are masked

TOTAL: 4*x cycles

CC 7.x, with independent thread scheduling

  • SIMT unit <- 16 threads
  • Execution:
    A <- x clock cycles, 16 odd threads
    B <- x clock cycles, 16 even threads

TOTAL: 2*x cycles

Is this correct or I’m missing something?

Sorry if these are silly questions, just trying to understand the reasons behind independent thread scheduling and how it actually works.

Thanks!

One of the biggest advantages to independent thread scheduling it that is better enables interleaved execution of statements from divergent branches (Figure 22 of whitepaper). Volta’s ability to independently schedule threads within a warp makes it possible to implement complex, fine-grained algorithms and data structures in a more natural way. Starvation-free algorithms are a great example. While they could be programmed prior to Volta, they were very cumbersome.

Another important advantage is that implementations that would cause a deadlock on previous hardware can now run as expected, in most cases.

It’s more about having the ability to code algorithms more naturally. Before, you might have to add extra code to get something to work… Causing sub-optimal performance.

Checkout https://www.icl.utk.edu/files/publications/2018/icl-utk-1080-2018.pdf and https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8344474.

I’ll have to get back to you on your example.

Based on your description, it seems to me that code can run faster when threads are scheduled independently IF the code has lots of conditional logic that results in frequent divergence. The conditional logic would be the key to that.