The process for assembling .ptx files into .o files for linking into a binary?


I’m looking into the process involved in manually compiling to .ptx files and then generating a binary from this. I’ve looked at the ‘Full CUDA Compilation trajectory’ in the nvcc guide and I’m still a confused.

I’m familiar with the process of using nvcc to compile straight to object files ready for linking into an application using gcc (I’m on x64_64 linux):

nvcc32 -o -arch sm_21

gcc412 -o main.cpp.o main.cpp 

gcc412 -o main main.cpp.o

How would I do this process manually going to ptx first?

nvcc32 -o main.ptx -ptx -arch sm_21

ptxas -o main.cubin main.ptx -cubin

As far as I understand it, I now have to generate the code in fatbin format as this what I link into the binary with the rest of the gcc object files? Also, I’m a little confused about the filehashing and generating the key values, I can’t seem to correctly pass this .cubin file into the fatbin app without this? Listing the nvcc steps seems to produce a lot of commands and I’m not sure what they all means, or more importantly which ones I can leave out if I’m starting with ptx rather than cu with mixed host / device code.

I’m aware that there are a few options to load the ptx at runtime, but I’d really prefer to compile this into a complete binary as in the first example if possible.



If it’s helpful to other people trying to do something similar, here’s what I have so far from pulling apart the nvcc toolchain. This is more than likely not the supported way of doing this, but it means I can link everything up into one binary, just like nvcc produces.

From the point I got to above, I first create my fatbin file:

fatbin --key="xxxxxxxxxx" --embedded-fatbin=main.fatbin.c "--image=profile=compute_20,file=main.ptx" "--image=profile=sm_21,file=main.cubin"

I then generate a simple register cpp file which looks something like this:

#include <math.h>

#include <cuda_runtime.h>

#include "/apps/Linux64/cuda/cuda-3.2/include/crt/device_runtime.h"

#include "/apps/Linux64/cuda/cuda-3.2/include/crt/host_runtime.h"

#include "main.fatbin.c"

void testKernel( int *__cuda_0,int *__cuda_1,int __cuda_2) { }

static void __register(void) __attribute__((__constructor__));

static void __register(void)




	__cudaRegisterFunction(__cudaFatCubinHandle, (const char*)((void ( *)(int *, int *, int))mainKernel), 

			(char*)"mainKernel", "mainKernel", -1, (uint3*)0, (uint3*)0, (dim3*)0, (dim3*)0, (int*)0); 


Now in my main.cpp file I use the cuda runtime API to setup my desired kernel launches and simple use the cudaLaunch function:


As before I now compile my main.cpp file into the main.o object file using gcc and compile the register.cpp file into a register.o object file. I can now link the register.o and main.o object files into the binary I was after.

Much of this process is undocumented and may be subject to change, so if anyone can suggest a more optimal way of achieving this, it would be nice to know about it. Either way I have managed to put together the binary I was after using just ptxas, fatbin and gcc (assuming I have already generated the ptx code) which is quite pleasing.