Does the new independent thread scheduling give better performance?

Thanks so much for the quick reply.

In the link you sent me, it says: “Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. To maximize parallel efficiency

I still can’t find an example or understand where this independent thread scheduling would help in “making better use of execution resources” or “maximize parallel efficiency.”

unless… the “SIMT units” mentioned in “Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units” are actually the processing blocks of a SM for a C.C. 7.x GPU. Since a processing block has 16 cores, the 32 threads of a warp can be split in two groups of 16 threads, each possibly executing a different instruction. Otherwise, if the SIMT unit would be of size 32, as for previous GPU architectures, then I don’t see any difference between having this independent thread scheduling or not. Here’s an example:

if (threadIdx.x %2)
   A; // say A takes x clock cycles
else
   B; // say B takes x clock cycles

CC < 7.0

  • SIMT unit ← 32 threads
  • Execution:
    A ← x clock cycles, 32 threads, even threads are masked
    B ← x clock cycles, 32 threads, odd threads are masked

TOTAL: 2*x cycles

CC 7.x, no independent thread scheduling

  • SIMT unit ← 16 threads
  • Execution:
    A ← x clock cycles, first 16 threads, even threads are masked
    A ← x clock cycles, second 16 threads, even threads are masked
    B ← x clock cycles, first 16 threads, odd threads are masked
    B ← x clock cycles, second 16 threads, odd threads are masked

TOTAL: 4*x cycles

CC 7.x, with independent thread scheduling

  • SIMT unit ← 16 threads
  • Execution:
    A ← x clock cycles, 16 odd threads
    B ← x clock cycles, 16 even threads

TOTAL: 2*x cycles

Is this correct or I’m missing something?

Sorry if these are silly questions, just trying to understand the reasons behind independent thread scheduling and how it actually works.

Thanks!