Does the new independent thread scheduling give better performance?

ancah · February 6, 2020, 10:40am

Thanks so much for the quick reply.

In the link you sent me, it says: “Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. To maximize parallel efficiency …”

I still can’t find an example or understand where this independent thread scheduling would help in “making better use of execution resources” or “maximize parallel efficiency.”

unless… the “SIMT units” mentioned in “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units” are actually the processing blocks of a SM for a C.C. 7.x GPU. Since a processing block has 16 cores, the 32 threads of a warp can be split in two groups of 16 threads, each possibly executing a different instruction. Otherwise, if the SIMT unit would be of size 32, as for previous GPU architectures, then I don’t see any difference between having this independent thread scheduling or not. Here’s an example:

if (threadIdx.x %2)
   A; // say A takes x clock cycles
else
   B; // say B takes x clock cycles

CC < 7.0

SIMT unit ← 32 threads
Execution:
A ← x clock cycles, 32 threads, even threads are masked
B ← x clock cycles, 32 threads, odd threads are masked

TOTAL: 2*x cycles

CC 7.x, no independent thread scheduling

SIMT unit ← 16 threads
Execution:
A ← x clock cycles, first 16 threads, even threads are masked
A ← x clock cycles, second 16 threads, even threads are masked
B ← x clock cycles, first 16 threads, odd threads are masked
B ← x clock cycles, second 16 threads, odd threads are masked

TOTAL: 4*x cycles

CC 7.x, with independent thread scheduling

SIMT unit ← 16 threads
Execution:
A ← x clock cycles, 16 odd threads
B ← x clock cycles, 16 even threads

TOTAL: 2*x cycles

Is this correct or I’m missing something?

Sorry if these are silly questions, just trying to understand the reasons behind independent thread scheduling and how it actually works.

Thanks!

Topic		Replies	Views
questions about thread execution & volatile CUDA Programming and Performance	19	17030	December 29, 2008
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16965	January 5, 2008
Impact of control flow on thread performance CUDA Programming and Performance	11	14021	January 17, 2008
SPMD on 8800GT Does the G80 has independent instruction counters? CUDA Programming and Performance	2	1440	June 18, 2010
performance gain by "killing" warps can there be any? CUDA Programming and Performance	5	2306	February 12, 2009
Warp divergence in independent thread scheduling? CUDA Programming and Performance	1	1426	September 7, 2021
Thread question CUDA Programming and Performance	5	1933	December 2, 2008
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4596	October 24, 2008
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3019	December 28, 2008
Scheduling on Fermi CUDA Programming and Performance	16	17614	August 9, 2010

Does the new independent thread scheduling give better performance?

Related topics