Does the new independent thread scheduling give better performance?

ancah · February 5, 2020, 2:35pm

I am looking for a simple example where the new independent thread scheduling would show its performance advantages in comparison with the previous per warp, lock-step mode execution.

Looking at a simple if/else example, just as presented here: Inside Volta: The World’s Most Advanced Data Center GPU | NVIDIA Technical Blog (see fig. 11 and fig. 12), I don’t see any speedup for the new approach.

Moreover, the same page says:
(1) “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units”
AND
(2) “Note that execution is still SIMT: at any given clock cycle CUDA cores execute the same instruction for all active threads in a warp just as before”

So, even if each thread has its own call stack and program counter, the scheduler does not create a heterogeneous mix of threds: threads running in parallel will always belong to the same warp and will always execute the same instruction. So, apart from changing execution order for A, B, X, Y, what else changes? When is it faster?

Any help would be very much appreciated.
Thanks!

mnicely · February 5, 2020, 3:18pm

I wouldn’t say the Independent Thread Scheduling is faster. It enables finer-grain synchronization and cooperation between parallel threads in a program. Per the blog post, statements from the if and else branches in the program can now be interleaved in time, allowing for more complex algorithms.

You can find more information in the Volta whitepaper https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

I don’t recall it being marketed as faster…

ancah · February 6, 2020, 10:40am

Thanks so much for the quick reply.

In the link you sent me, it says: “Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. To maximize parallel efficiency …”

I still can’t find an example or understand where this independent thread scheduling would help in “making better use of execution resources” or “maximize parallel efficiency.”

unless… the “SIMT units” mentioned in “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units” are actually the processing blocks of a SM for a C.C. 7.x GPU. Since a processing block has 16 cores, the 32 threads of a warp can be split in two groups of 16 threads, each possibly executing a different instruction. Otherwise, if the SIMT unit would be of size 32, as for previous GPU architectures, then I don’t see any difference between having this independent thread scheduling or not. Here’s an example:

if (threadIdx.x %2)
   A; // say A takes x clock cycles
else
   B; // say B takes x clock cycles

CC < 7.0

SIMT unit ← 32 threads
Execution:
A ← x clock cycles, 32 threads, even threads are masked
B ← x clock cycles, 32 threads, odd threads are masked

TOTAL: 2*x cycles

CC 7.x, no independent thread scheduling

SIMT unit ← 16 threads
Execution:
A ← x clock cycles, first 16 threads, even threads are masked
A ← x clock cycles, second 16 threads, even threads are masked
B ← x clock cycles, first 16 threads, odd threads are masked
B ← x clock cycles, second 16 threads, odd threads are masked

TOTAL: 4*x cycles

CC 7.x, with independent thread scheduling

SIMT unit ← 16 threads
Execution:
A ← x clock cycles, 16 odd threads
B ← x clock cycles, 16 even threads

TOTAL: 2*x cycles

Is this correct or I’m missing something?

Sorry if these are silly questions, just trying to understand the reasons behind independent thread scheduling and how it actually works.

Thanks!

mnicely · February 6, 2020, 3:45pm

One of the biggest advantages to independent thread scheduling it that is better enables interleaved execution of statements from divergent branches (Figure 22 of whitepaper). Volta’s ability to independently schedule threads within a warp makes it possible to implement complex, fine-grained algorithms and data structures in a more natural way. Starvation-free algorithms are a great example. While they could be programmed prior to Volta, they were very cumbersome.

Another important advantage is that implementations that would cause a deadlock on previous hardware can now run as expected, in most cases.

It’s more about having the ability to code algorithms more naturally. Before, you might have to add extra code to get something to work… Causing sub-optimal performance.

Checkout https://www.icl.utk.edu/files/publications/2018/icl-utk-1080-2018.pdf and https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8344474.

I’ll have to get back to you on your example.

ryork · February 6, 2020, 7:14pm

Based on your description, it seems to me that code can run faster when threads are scheduled independently IF the code has lots of conditional logic that results in frequent divergence. The conditional logic would be the key to that.

Topic		Replies	Views
Question about warp execution and the warp scheduler CUDA Programming and Performance	5	522	July 5, 2025
Warp divergence in independent thread scheduling? CUDA Programming and Performance	1	1459	September 7, 2021
For Loops with If Statements Inside Kernel CUDA Programming and Performance	10	630	June 20, 2024
Independent thread scheduling not working CUDA Programming and Performance	1	123	February 15, 2025
execution within one diverged warp CUDA Programming and Performance	2	544	February 21, 2020
handling thread divergence, Volta and Turing CUDA Programming and Performance	2	1627	January 19, 2020
Is Independent Thread Scheduling reconvergence based on Program Counter or Instruction? CUDA Programming and Performance	8	170	October 16, 2025
SPMD on 8800GT Does the G80 has independent instruction counters? CUDA Programming and Performance	2	1454	June 18, 2010
questions about thread execution & volatile CUDA Programming and Performance	19	17102	December 29, 2008
threads in a warp still in lock-step? CUDA Programming and Performance	4	3334	January 31, 2019

Does the new independent thread scheduling give better performance?

Related topics