Running PTX Code from CUDA 4.0 in CUDA 4.1 or CUDA 4.2

Hello,

I have compiled all my CUDA code in 32 and 64bit using the CUDA compiler version 4.0.

Now a customer reported that he can only install CUDA 4.1 on his notebook. Currently

my application is locked to CUDA 4.0, but I opened it up the test the application

on CUDA 4.1. On my computer I installed CUDA 4.2, but when I run my application there is an

exception saying that the binary version is not suitable for the GPU.

My question now is: Do I have to recompile all my code when I switch the CUDA version?

If no, what could be the problem? If yes, how do others cope with this problem? Do they deliver

an application for each CUDA version?

I thought building PTX code is “future safe”, or am I doing something wrong?

Currently I’m loading the CUDA code in a given context, using

CUresult aResult = CUDADriverAPI.cuModuleLoadDataEx(

                                    ref InternalHandle,

                                    (IntPtr)cudaByteCode_pinned,

                                    0,

                                    null,

                                    null

                                    );

Thanks

Martin

What you describe sounds more like a problem with binary incompatibility of the GPU in your system with the GPU machine code stored in the object file, rather than an incompatibility between CUDA versions.

CUDA supports fat binaries, which consist of a collection of one or several precompiled machine code versions, plus one or several PTX code versions. When the driver loads a fat binary, it first looks for maching binary machine code, if it cannot find that it looks for appropriate PTX to JIT. Here is an example for a collection of NVCC switches that builds a fatbinary for sm_13, sm_20, and sm_30 (both machine code and PTX):

-gencode arch=compute_13,\"code=sm_13,compute_13\" -gencode arch=compute_20,\"code=sm_20,compute_20\" -gencode arch=compute_30,\"code=sm_30,compute_30\"

I am not familiar with the CUDA driver API (seven years ago I was the first user of the CUDA runtime and have never looked back), but I notice a CUDA driver API function whose name suggests it is appropriate for loading fat binaries: cuModuleLoadFatBinary().

[Later:] Here is another thought. When you installed the CUDA 4.2 toolkit on your machine, did you also update the driver package?

Hello njuffa,

thank your your response and the explanations of the arguments for gencode I haven’t tried so far. Interestingly we are currently using a tesla c870, c1060, c1070, c2050, c2070 and a c2075 and the way i have done it so far worked for all of these cards. As you have asked before I also updated the driver (i have unistalled everything before). i normally compile the code for arch= compute_13 but what I have to check is if I set the code parameter correctly. Currently if the jit runs it creates a running code module with compute capability 2.0 for my c2075 although the code was compiled for 1.3. do you think that the formers versions of my code only ran by accident?

thanks
martin

I am not sure I understand how you are running your code. PTX is a virtual instruction set that allows to write code that will run on a number of different architecture that do not have binary compatibility. sm_20 based GPUs provide different instructions and different instruction encodings than sm_13 based GPUs, so sm_13 code machine simply cannot run on an sm_20 device.

When you specify compute_13, the compiler generates PTX code limited to those PTX operations supported on compute capability 1.3. When this code is then JITed on a C2070 (which has compute capability 2.0) the machine code generated must be for sm_20, as other machine code will not run on an sm_20 device. The JIT process will map a PTX instruction directly to an equivalent machine instruction if one exists, or to an emulation sequence if no corresponding native instruction exists.

What compute capability is the GPU in the notebook? PTX code generated for compute_13 can be translated by the JIT compiler for compute capability 1.3 and higher, but not for devices of lower compute capability. There are still a lot of compute capability 1.1 GPUs around on notebooks though, so this seems to be the likely problem here. If your code can run without the features provided by higher compute capabilities, the -gencode option described by njuffa allows you to add PTX code suitable for compute capability 1.1 (or even 1.0) in addition to the one you already compile for currently.

Hi tera and njuffa,

I have found the problem. Someone has changed our switch to compute capability 2.0. This is the reason why it didn’t work
on the C1060 with compute capability 1.3. So everything has resolved now. Nevertheless, thanks for your help.

Martin