Running PTX Code from CUDA 4.0 in CUDA 4.1 or CUDA 4.2

MartyMcFly · May 29, 2012, 3:55pm

Hello,

I have compiled all my CUDA code in 32 and 64bit using the CUDA compiler version 4.0.

Now a customer reported that he can only install CUDA 4.1 on his notebook. Currently

my application is locked to CUDA 4.0, but I opened it up the test the application

on CUDA 4.1. On my computer I installed CUDA 4.2, but when I run my application there is an

exception saying that the binary version is not suitable for the GPU.

My question now is: Do I have to recompile all my code when I switch the CUDA version?

If no, what could be the problem? If yes, how do others cope with this problem? Do they deliver

an application for each CUDA version?

I thought building PTX code is “future safe”, or am I doing something wrong?

Currently I’m loading the CUDA code in a given context, using

CUresult aResult = CUDADriverAPI.cuModuleLoadDataEx(

                                    ref InternalHandle,

                                    (IntPtr)cudaByteCode_pinned,

                                    0,

                                    null,

                                    null

                                    );

Thanks

Martin

njuffa · May 29, 2012, 7:13pm

What you describe sounds more like a problem with binary incompatibility of the GPU in your system with the GPU machine code stored in the object file, rather than an incompatibility between CUDA versions.

CUDA supports fat binaries, which consist of a collection of one or several precompiled machine code versions, plus one or several PTX code versions. When the driver loads a fat binary, it first looks for maching binary machine code, if it cannot find that it looks for appropriate PTX to JIT. Here is an example for a collection of NVCC switches that builds a fatbinary for sm_13, sm_20, and sm_30 (both machine code and PTX):

-gencode arch=compute_13,\"code=sm_13,compute_13\" -gencode arch=compute_20,\"code=sm_20,compute_20\" -gencode arch=compute_30,\"code=sm_30,compute_30\"

I am not familiar with the CUDA driver API (seven years ago I was the first user of the CUDA runtime and have never looked back), but I notice a CUDA driver API function whose name suggests it is appropriate for loading fat binaries: cuModuleLoadFatBinary().

[Later:] Here is another thought. When you installed the CUDA 4.2 toolkit on your machine, did you also update the driver package?

MartyMcFly · May 29, 2012, 8:57pm

Hello njuffa,

thank your your response and the explanations of the arguments for gencode I haven’t tried so far. Interestingly we are currently using a tesla c870, c1060, c1070, c2050, c2070 and a c2075 and the way i have done it so far worked for all of these cards. As you have asked before I also updated the driver (i have unistalled everything before). i normally compile the code for arch= compute_13 but what I have to check is if I set the code parameter correctly. Currently if the jit runs it creates a running code module with compute capability 2.0 for my c2075 although the code was compiled for 1.3. do you think that the formers versions of my code only ran by accident?

thanks
martin

njuffa · May 29, 2012, 11:37pm

I am not sure I understand how you are running your code. PTX is a virtual instruction set that allows to write code that will run on a number of different architecture that do not have binary compatibility. sm_20 based GPUs provide different instructions and different instruction encodings than sm_13 based GPUs, so sm_13 code machine simply cannot run on an sm_20 device.

When you specify compute_13, the compiler generates PTX code limited to those PTX operations supported on compute capability 1.3. When this code is then JITed on a C2070 (which has compute capability 2.0) the machine code generated must be for sm_20, as other machine code will not run on an sm_20 device. The JIT process will map a PTX instruction directly to an equivalent machine instruction if one exists, or to an emulation sequence if no corresponding native instruction exists.

tera · May 30, 2012, 9:46am

What compute capability is the GPU in the notebook? PTX code generated for compute_13 can be translated by the JIT compiler for compute capability 1.3 and higher, but not for devices of lower compute capability. There are still a lot of compute capability 1.1 GPUs around on notebooks though, so this seems to be the likely problem here. If your code can run without the features provided by higher compute capabilities, the -gencode option described by njuffa allows you to add PTX code suitable for compute capability 1.1 (or even 1.0) in addition to the one you already compile for currently.

MartyMcFly · May 30, 2012, 9:48am

Hi tera and njuffa,

I have found the problem. Someone has changed our switch to compute capability 2.0. This is the reason why it didn’t work
on the C1060 with compute capability 1.3. So everything has resolved now. Nevertheless, thanks for your help.

Martin

Topic		Replies	Views
CUDA NVCC creates .target 5.0 CUDA Programming and Performance	4	752	January 12, 2017
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2599	February 11, 2015
GT630 Compatibility? CUDA Programming and Performance	6	3669	December 8, 2012
How do You Run a CUDA Program on Multiple Systems? CUDA Programming and Performance	8	6285	August 16, 2011
Cuda Error on GeForce RTX 2080 Ti vs no err on Quadro M2200 CUDA Programming and Performance	5	2450	January 18, 2019
How should I use correctly the sm_XX and compute_XX? CUDA Programming and Performance	3	4541	July 14, 2022
JIT compilation PTX to machine code may fail for certain GPUs ? CUDA Programming and Performance	4	5667	January 21, 2015
PTXAS Fatal: Memory Allocation Failure CUDA Programming and Performance	10	3511	April 10, 2017
GTX TITAN X and other >4GB VRAM cards CUDA Programming and Performance	10	2574	June 16, 2015
Can no longer create backward compatible CUDA binary with Titan V and CUDA 9 CUDA Setup and Installation	4	1039	August 2, 2018

Running PTX Code from CUDA 4.0 in CUDA 4.1 or CUDA 4.2

Related topics