PTX Code Transformations

Gregory_Diamos · September 16, 2009, 10:27pm

So I am in the process of writing a PTX-to-PTX and PTX-to-LLVM just-in-time compiler as part of Ocelot ( http://code.google.com/p/gpuocelot/ ) and I am beginning to implement some code transformations that may be able to add additional features to CUDA.

I am wondering if people on these forums would find any of the following useful:

We could easily add support for recursive function calls within kernels with very little performance overhead. The CUDA compiler will not currently generate code that uses call instructions recursively, however, we could support this at the PTX level by manually creating a stack in local memory and spilling and loading live registers on call instructions.
Allow barriers to work in conditional code. For example in

if( x )

{

  ....

  __syncthreads();

}

else

{

 ...

 __syncthreads();

}

I don’t believe that this currently works, but I could be wrong.

Support an arbitrary number of threads per CTA via thread fusion (multiplexing several CUDA threads onto a single hardware thread). This is necessary when compiling PTX to single threaded architectures like x86, but it can be equally applied to GPUs.
Context switch between kernels. We could also support pausing and resuming a kernel. For example, we might be able to get around the 5sec kernel launch limit on windows by pausing and resuming every few seconds.
Detect memory errors (segfaults, etc) and fail gracefully (pause kernel execution and report which threads segfaulted).

Also, for anyone with compiler experience, I am currently looking into different algorithms for implementing some of the above. 3) in particular has been explored a bit in literature, but it seems like most approaches have a fairly significant overhead to switch between threads.

tmurray · September 16, 2009, 10:48pm

Handy.
Handy if there’s not a huge perf hit.
I don’t think it works in the general case.
The timeout and kernel execution time is not knowable at compile time.
Not sure how you’re going to do that on actual hardware…

Regardless, I’m excited. Are you coming to GTC? It’d be good to talk.

Gregory_Diamos · September 16, 2009, 11:25pm

This would be the simplest thing to do, but we would probably need the compile to be able to generate code that does this to make it useful. I think that older versions like 2.1 would do this if you included a no_inline flag on kernels, but as of 2.3 at least it seems to throw an error. It would be nice if the compiler could be modified to generate correct code if you give it a flag.

A simple approach to doing this would be to replace all barriers with a jump into and out of a block that contained a single barrier, so that all threads would always hit the same barrier if they hit a barrier at all. The overhead would be one direct branch and one indirect branch per barrier.

There will be a possibly significant overhead of doing this on a GPU. The basic approach is to add a loop for each PTX thread that executes N CUDA threads, we modify all accesses to local memory to include a stride based on the current thread id and modify all accesses to special registers (%tidx, %tidy, etc) to add an offset. This requires some effort to support barriers and it changes the warp size. The advantage would be that it would allow the programmer to specify an arbitrary amount of parallelism within a CTA, and then let the the compiler fuse threads together to fit the constraints of the hardware.

Yes, but we would be modifying the code when it is loaded, just before it is executed on the GPU, so we would know the timeout and whether or not the GPU was subject to a maximum execution time constraint. The approach would be to instrument back edges in the control flow graph with checks against a timer and, if the timer has expired, save live registers for each thread on a stack in global memory as well as the exit point for each thread. To resume the kernel, we would include prologue code in each kernel that would restore live registers and jump to the previous location.

The basic approach would be to copy the memory map of all allocated variables and the size of statically allocated memory regions into a hidden data structure in global memory before launching a kernel. Then guarding all load and store instructions with checks that make sure the address is mapped or bail out otherwise. This would incur a very significant performance overhead, but might be useful for debugging.

Support via a PTX visible MMU that would do the checks in hardware would make this much faster.

I have a poster and a presentation at the research summit. I’ll keep an eye out for you there.

Topic		Replies	Views
Programming the PTX virtual machine resolved many high-level issues posted CUDA Programming and Performance	10	6208	August 29, 2007
Ability to run PTX directly CUDA Programming and Performance	2	4454	November 11, 2009
Example code using PTX CUDA Programming and Performance	6	9144	March 25, 2008
.loc in PTX code CUDA Programming and Performance kernel	6	881	March 16, 2023
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27966	November 5, 2008
Is there any chance to implement barrier for Optix 7? OptiX	4	1198	March 29, 2021
Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX Technical Blog	2	127	November 10, 2025
Some problems with inline PTX CUDA Programming and Performance	6	1916	March 6, 2013
Understanding PTX, the Assembly Language of CUDA GPU Computing Technical Blog	1	166	August 17, 2025
CUDA PTX advise help making a library of sorts for gpu structures CUDA Programming and Performance	0	4098	July 5, 2010

PTX Code Transformations

Related topics