Thread Scheduling In Ocelot Some Musings

So one year later after implementing a scheme for scheduling CUDA threads onto CPU hardware threads, I have an opportunity to go back and redesign it from scratch. I thought that I would see if anyone on here would be interested in commenting on what I have come up with so far:

The main criterion here is speed. Every other concern such as scheduling and state management is a means to get more seed.

These are the assumptions that I make. Each complete kernel is broken up into a N sub-kernels. Each sub-kernel starts from either a program entry point, a barrier, or an arbitrary path. Each sub-kernel ends in either a program exit point, a function call, or a path of a specific length. Code is inserted into each sub-kernel such that every path that exits the kernel through a point other than the main exit point saves the id of the kernel to resume execution at, and if there are multiple entry points to that kernel, the id of the correct entry point. Each kernel with more than one entry point is augmented with a scheduler block that performs an indirect branch to the specified entry point depending on an input value. Upon exiting a sub-kernel thread-local state is saved on a stack and upon entering it is restored.

Two versions of each kernel is generated. One is a serial version and one is a fused version. The serial version evaluates one thread at a time while the fused version evaluates N threads at a time that execute in lock-step. The fused version is generated by declaring ever scalar instruction as an N-wide vector instruction and then performing common sub-expression elimination to issue scalar instructions for values that are shared among all threads to avoid recomputing the same values. For every divergent branch, a vector comparison instruction is inserted and if the branch condition is not evaluated uniformly among all threads, all threads jump out of the fused kernel and save their state. Fused threads may contain vector (SSE,AVX) instructions or a series of scalar instructions. The assumption is that scalar instructions will give a performance advantage because instructions from different threads cannot have data dependencies. The limitation is that fusing more threads together will increase register pressure, so the optimal value of N will vary from kernel to kernel. Assume that it is chosen heuristically to try to balance these constraints.

I assume that the ability to do this already exists.

The problem is now how to manage state among all of the threads and determine their scheduling order. The goal is to spend less than 5% of the time in the scheduler while still getting close to an optimal schedule. The optimal schedule is defined as the schedule that produces the shortest total execution time.

Here is what I have so far:

  1. During sub-kernel formation and kernel-fusion, make sure that the number of instructions on all paths from the entry point to any exit point is around 20x the number in the worst case in the scheduler. Unroll and split loops if necessary to accomplish this.

  2. Preallocate enough state for the max number of threads per CTA and share it across CTA invocations. This is not in the scheduler loop.

    Always issue threads with the widest vector width available.

    1. Start by launching threads in order, if they finish before exiting
      the subkernel, then they die here and their state is reclaimed. If they
      bail out due to divergence or hit a context switch point, then allocate
      a new thread context for the next thread and save the current thread
      context’s id for scheduling in a queue for the next subkernel.
    2. Sort the queues by thread count. Pick the queue with the most waiting
      threads.
      a ) If it has not been jitted yet, do so now.
      b ) Group threads together into warps of fused kernel width
      (possibly by sorting but FCFS should require less overhead).
      Launch them all.
      c ) If a thread exits, kill it and reclaim the state, otherwise move
      it to another queue.
      d ) If a thread hits a barrier, put it into a barrier queue.
      e ) Once all threads have been launched we are done with this sub-kernel.
    3. Reorganize the threads
      a ) If all threads are in the barrier queue, move them back into their
      corresponding queues.
      b ) If there is at least one thread left in at least one queue, goto 2.
      c ) If all threads are finished, the CTA is done.

Thanks in advance for any replies. The nice thing about working on open-source software is that I can ask other people’s opinions before I start working :)

So one year later after implementing a scheme for scheduling CUDA threads onto CPU hardware threads, I have an opportunity to go back and redesign it from scratch. I thought that I would see if anyone on here would be interested in commenting on what I have come up with so far:

The main criterion here is speed. Every other concern such as scheduling and state management is a means to get more seed.

These are the assumptions that I make. Each complete kernel is broken up into a N sub-kernels. Each sub-kernel starts from either a program entry point, a barrier, or an arbitrary path. Each sub-kernel ends in either a program exit point, a function call, or a path of a specific length. Code is inserted into each sub-kernel such that every path that exits the kernel through a point other than the main exit point saves the id of the kernel to resume execution at, and if there are multiple entry points to that kernel, the id of the correct entry point. Each kernel with more than one entry point is augmented with a scheduler block that performs an indirect branch to the specified entry point depending on an input value. Upon exiting a sub-kernel thread-local state is saved on a stack and upon entering it is restored.

Two versions of each kernel is generated. One is a serial version and one is a fused version. The serial version evaluates one thread at a time while the fused version evaluates N threads at a time that execute in lock-step. The fused version is generated by declaring ever scalar instruction as an N-wide vector instruction and then performing common sub-expression elimination to issue scalar instructions for values that are shared among all threads to avoid recomputing the same values. For every divergent branch, a vector comparison instruction is inserted and if the branch condition is not evaluated uniformly among all threads, all threads jump out of the fused kernel and save their state. Fused threads may contain vector (SSE,AVX) instructions or a series of scalar instructions. The assumption is that scalar instructions will give a performance advantage because instructions from different threads cannot have data dependencies. The limitation is that fusing more threads together will increase register pressure, so the optimal value of N will vary from kernel to kernel. Assume that it is chosen heuristically to try to balance these constraints.

I assume that the ability to do this already exists.

The problem is now how to manage state among all of the threads and determine their scheduling order. The goal is to spend less than 5% of the time in the scheduler while still getting close to an optimal schedule. The optimal schedule is defined as the schedule that produces the shortest total execution time.

Here is what I have so far:

  1. During sub-kernel formation and kernel-fusion, make sure that the number of instructions on all paths from the entry point to any exit point is around 20x the number in the worst case in the scheduler. Unroll and split loops if necessary to accomplish this.

  2. Preallocate enough state for the max number of threads per CTA and share it across CTA invocations. This is not in the scheduler loop.

    Always issue threads with the widest vector width available.

    1. Start by launching threads in order, if they finish before exiting
      the subkernel, then they die here and their state is reclaimed. If they
      bail out due to divergence or hit a context switch point, then allocate
      a new thread context for the next thread and save the current thread
      context’s id for scheduling in a queue for the next subkernel.
    2. Sort the queues by thread count. Pick the queue with the most waiting
      threads.
      a ) If it has not been jitted yet, do so now.
      b ) Group threads together into warps of fused kernel width
      (possibly by sorting but FCFS should require less overhead).
      Launch them all.
      c ) If a thread exits, kill it and reclaim the state, otherwise move
      it to another queue.
      d ) If a thread hits a barrier, put it into a barrier queue.
      e ) Once all threads have been launched we are done with this sub-kernel.
    3. Reorganize the threads
      a ) If all threads are in the barrier queue, move them back into their
      corresponding queues.
      b ) If there is at least one thread left in at least one queue, goto 2.
      c ) If all threads are finished, the CTA is done.

Thanks in advance for any replies. The nice thing about working on open-source software is that I can ask other people’s opinions before I start working :)