Thread Scheduling In Ocelot Some Musings

Gregory_Diamos · September 27, 2010, 10:47pm

So one year later after implementing a scheme for scheduling CUDA threads onto CPU hardware threads, I have an opportunity to go back and redesign it from scratch. I thought that I would see if anyone on here would be interested in commenting on what I have come up with so far:

The main criterion here is speed. Every other concern such as scheduling and state management is a means to get more seed.

These are the assumptions that I make. Each complete kernel is broken up into a N sub-kernels. Each sub-kernel starts from either a program entry point, a barrier, or an arbitrary path. Each sub-kernel ends in either a program exit point, a function call, or a path of a specific length. Code is inserted into each sub-kernel such that every path that exits the kernel through a point other than the main exit point saves the id of the kernel to resume execution at, and if there are multiple entry points to that kernel, the id of the correct entry point. Each kernel with more than one entry point is augmented with a scheduler block that performs an indirect branch to the specified entry point depending on an input value. Upon exiting a sub-kernel thread-local state is saved on a stack and upon entering it is restored.

Two versions of each kernel is generated. One is a serial version and one is a fused version. The serial version evaluates one thread at a time while the fused version evaluates N threads at a time that execute in lock-step. The fused version is generated by declaring ever scalar instruction as an N-wide vector instruction and then performing common sub-expression elimination to issue scalar instructions for values that are shared among all threads to avoid recomputing the same values. For every divergent branch, a vector comparison instruction is inserted and if the branch condition is not evaluated uniformly among all threads, all threads jump out of the fused kernel and save their state. Fused threads may contain vector (SSE,AVX) instructions or a series of scalar instructions. The assumption is that scalar instructions will give a performance advantage because instructions from different threads cannot have data dependencies. The limitation is that fusing more threads together will increase register pressure, so the optimal value of N will vary from kernel to kernel. Assume that it is chosen heuristically to try to balance these constraints.

I assume that the ability to do this already exists.

The problem is now how to manage state among all of the threads and determine their scheduling order. The goal is to spend less than 5% of the time in the scheduler while still getting close to an optimal schedule. The optimal schedule is defined as the schedule that produces the shortest total execution time.

Here is what I have so far:

During sub-kernel formation and kernel-fusion, make sure that the number of instructions on all paths from the entry point to any exit point is around 20x the number in the worst case in the scheduler. Unroll and split loops if necessary to accomplish this.
Preallocate enough state for the max number of threads per CTA and share it across CTA invocations. This is not in the scheduler loop.

Always issue threads with the widest vector width available.
1. Start by launching threads in order, if they finish before exiting
  the subkernel, then they die here and their state is reclaimed. If they
  bail out due to divergence or hit a context switch point, then allocate
  a new thread context for the next thread and save the current thread
  context’s id for scheduling in a queue for the next subkernel.
2. Sort the queues by thread count. Pick the queue with the most waiting
  threads.
  a ) If it has not been jitted yet, do so now.
  b ) Group threads together into warps of fused kernel width
  (possibly by sorting but FCFS should require less overhead).
  Launch them all.
  c ) If a thread exits, kill it and reclaim the state, otherwise move
  it to another queue.
  d ) If a thread hits a barrier, put it into a barrier queue.
  e ) Once all threads have been launched we are done with this sub-kernel.
3. Reorganize the threads
  a ) If all threads are in the barrier queue, move them back into their
  corresponding queues.
  b ) If there is at least one thread left in at least one queue, goto 2.
  c ) If all threads are finished, the CTA is done.

Thanks in advance for any replies. The nice thing about working on open-source software is that I can ask other people’s opinions before I start working :)

Gregory_Diamos · September 27, 2010, 10:47pm

So one year later after implementing a scheme for scheduling CUDA threads onto CPU hardware threads, I have an opportunity to go back and redesign it from scratch. I thought that I would see if anyone on here would be interested in commenting on what I have come up with so far:

The main criterion here is speed. Every other concern such as scheduling and state management is a means to get more seed.

These are the assumptions that I make. Each complete kernel is broken up into a N sub-kernels. Each sub-kernel starts from either a program entry point, a barrier, or an arbitrary path. Each sub-kernel ends in either a program exit point, a function call, or a path of a specific length. Code is inserted into each sub-kernel such that every path that exits the kernel through a point other than the main exit point saves the id of the kernel to resume execution at, and if there are multiple entry points to that kernel, the id of the correct entry point. Each kernel with more than one entry point is augmented with a scheduler block that performs an indirect branch to the specified entry point depending on an input value. Upon exiting a sub-kernel thread-local state is saved on a stack and upon entering it is restored.

Two versions of each kernel is generated. One is a serial version and one is a fused version. The serial version evaluates one thread at a time while the fused version evaluates N threads at a time that execute in lock-step. The fused version is generated by declaring ever scalar instruction as an N-wide vector instruction and then performing common sub-expression elimination to issue scalar instructions for values that are shared among all threads to avoid recomputing the same values. For every divergent branch, a vector comparison instruction is inserted and if the branch condition is not evaluated uniformly among all threads, all threads jump out of the fused kernel and save their state. Fused threads may contain vector (SSE,AVX) instructions or a series of scalar instructions. The assumption is that scalar instructions will give a performance advantage because instructions from different threads cannot have data dependencies. The limitation is that fusing more threads together will increase register pressure, so the optimal value of N will vary from kernel to kernel. Assume that it is chosen heuristically to try to balance these constraints.

I assume that the ability to do this already exists.

The problem is now how to manage state among all of the threads and determine their scheduling order. The goal is to spend less than 5% of the time in the scheduler while still getting close to an optimal schedule. The optimal schedule is defined as the schedule that produces the shortest total execution time.

Here is what I have so far:

During sub-kernel formation and kernel-fusion, make sure that the number of instructions on all paths from the entry point to any exit point is around 20x the number in the worst case in the scheduler. Unroll and split loops if necessary to accomplish this.
Preallocate enough state for the max number of threads per CTA and share it across CTA invocations. This is not in the scheduler loop.

Always issue threads with the widest vector width available.
1. Start by launching threads in order, if they finish before exiting
  the subkernel, then they die here and their state is reclaimed. If they
  bail out due to divergence or hit a context switch point, then allocate
  a new thread context for the next thread and save the current thread
  context’s id for scheduling in a queue for the next subkernel.
2. Sort the queues by thread count. Pick the queue with the most waiting
  threads.
  a ) If it has not been jitted yet, do so now.
  b ) Group threads together into warps of fused kernel width
  (possibly by sorting but FCFS should require less overhead).
  Launch them all.
  c ) If a thread exits, kill it and reclaim the state, otherwise move
  it to another queue.
  d ) If a thread hits a barrier, put it into a barrier queue.
  e ) Once all threads have been launched we are done with this sub-kernel.
3. Reorganize the threads
  a ) If all threads are in the barrier queue, move them back into their
  corresponding queues.
  b ) If there is at least one thread left in at least one queue, goto 2.
  c ) If all threads are finished, the CTA is done.

Thanks in advance for any replies. The nice thing about working on open-source software is that I can ask other people’s opinions before I start working :)

Topic		Replies	Views
Best way to communicate small amount of data across CTAs? CUDA Programming and Performance	9	1400	August 3, 2022
Launch Parameters for Large Problems CUDA Programming and Performance cuda , kernel	13	1924	October 12, 2021
questions about thread execution & volatile CUDA Programming and Performance	19	16886	December 29, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8609	December 18, 2008
Killing all kernel threads CUDA Programming and Performance	16	4894	October 2, 2015
Advice on porting to an HPC application to GPU nvc, nvc++ and nvfortran	6	58	July 30, 2024
Synchronization methods? CUDA Programming and Performance	11	2069	November 7, 2010
Questions on Thread-level resource management CUDA Programming and Performance	6	534	December 3, 2018
reasons why splitting large kernel to smaller one lower perfromance CUDA Programming and Performance	4	3651	February 15, 2016
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18554	December 13, 2007

Thread Scheduling In Ocelot Some Musings

Related topics