PTX Code Transformations

So I am in the process of writing a PTX-to-PTX and PTX-to-LLVM just-in-time compiler as part of Ocelot ( ) and I am beginning to implement some code transformations that may be able to add additional features to CUDA.

I am wondering if people on these forums would find any of the following useful:

  1. We could easily add support for recursive function calls within kernels with very little performance overhead. The CUDA compiler will not currently generate code that uses call instructions recursively, however, we could support this at the PTX level by manually creating a stack in local memory and spilling and loading live registers on call instructions.

  2. Allow barriers to work in conditional code. For example in

if( x )










I don’t believe that this currently works, but I could be wrong.

  1. Support an arbitrary number of threads per CTA via thread fusion (multiplexing several CUDA threads onto a single hardware thread). This is necessary when compiling PTX to single threaded architectures like x86, but it can be equally applied to GPUs.

  2. Context switch between kernels. We could also support pausing and resuming a kernel. For example, we might be able to get around the 5sec kernel launch limit on windows by pausing and resuming every few seconds.

  3. Detect memory errors (segfaults, etc) and fail gracefully (pause kernel execution and report which threads segfaulted).

Also, for anyone with compiler experience, I am currently looking into different algorithms for implementing some of the above. 3) in particular has been explored a bit in literature, but it seems like most approaches have a fairly significant overhead to switch between threads.

  1. Handy.
  2. Handy if there’s not a huge perf hit.
  3. I don’t think it works in the general case.
  4. The timeout and kernel execution time is not knowable at compile time.
  5. Not sure how you’re going to do that on actual hardware…

Regardless, I’m excited. Are you coming to GTC? It’d be good to talk.

This would be the simplest thing to do, but we would probably need the compile to be able to generate code that does this to make it useful. I think that older versions like 2.1 would do this if you included a no_inline flag on kernels, but as of 2.3 at least it seems to throw an error. It would be nice if the compiler could be modified to generate correct code if you give it a flag.

A simple approach to doing this would be to replace all barriers with a jump into and out of a block that contained a single barrier, so that all threads would always hit the same barrier if they hit a barrier at all. The overhead would be one direct branch and one indirect branch per barrier.

There will be a possibly significant overhead of doing this on a GPU. The basic approach is to add a loop for each PTX thread that executes N CUDA threads, we modify all accesses to local memory to include a stride based on the current thread id and modify all accesses to special registers (%tidx, %tidy, etc) to add an offset. This requires some effort to support barriers and it changes the warp size. The advantage would be that it would allow the programmer to specify an arbitrary amount of parallelism within a CTA, and then let the the compiler fuse threads together to fit the constraints of the hardware.

Yes, but we would be modifying the code when it is loaded, just before it is executed on the GPU, so we would know the timeout and whether or not the GPU was subject to a maximum execution time constraint. The approach would be to instrument back edges in the control flow graph with checks against a timer and, if the timer has expired, save live registers for each thread on a stack in global memory as well as the exit point for each thread. To resume the kernel, we would include prologue code in each kernel that would restore live registers and jump to the previous location.

The basic approach would be to copy the memory map of all allocated variables and the size of statically allocated memory regions into a hidden data structure in global memory before launching a kernel. Then guarding all load and store instructions with checks that make sure the address is mapped or bail out otherwise. This would incur a very significant performance overhead, but might be useful for debugging.

Support via a PTX visible MMU that would do the checks in hardware would make this much faster.

I have a poster and a presentation at the research summit. I’ll keep an eye out for you there.