I have a CUDA 1.1 .so that works just fine on my 9800, but when I switch to my GT630 I get an “Invalid Device Function” error. This seems very odd as I make no change at all in the .so; I don’t even reboot. Both the 9800 and the GT630 run all the CUDA samples (including deviceQuery) just fine. The GT630 is compute capability 2.1, but there is supposed to be backwards compatibility. Right?
Correct, there is backwards compatibility. Newer CUDA versions will work with older GPUs. So CUDA 5.0 works with every CUDA capable device from the latest Kepler K20 downto the initial G80.
What you are attempting would be forward compatability, where CUDA 1.1 would have to work with devices that weren’t even around when CUDA 1.1 shipped.
Aren’t there a couple of different pieces here? The CUDA Compute Capability of the code is 1.1. The CUDA Compute Capability of the GT630 is 2.1. Shouldn’t it be able to run 1.1 code?
If I read the CUDA Compute Capability document properly, each higher level is a superset of the previous and should be able to run the older code.
I think njuffa was thinking of driver levels?
Sorry, I understood “I have a CUDA 1.1 so that works just fine on my 9800” to mean “I am running CUDA 1.1”, not as “I have a compute capability 1.1 device”. What CUDA version are you running, by the way, and what drivers are installed?
There is no binary code compatibility between sm_1x (compute capability 1.x) and sm_2x (compute capability 2.x) GPUs. In order to have code compiled for sm_1x run on sm_2x, the code must be compiled such that it includes GPU kernels compiled to PTX, which is a portable intermediate format. This PTX code can then be JIT (just in time) compiled by the compiler inside the driver for an sm_2x target. If a PTX version of the kernel is not present in the executable, the invocation of the kernel will fail.
If the code in question is code you can build yourself, try building it with the following nvcc commandline argument, which will cause the creation of binary code for sm_10, as well as corresponding PTX code that can be JIT compiled to sm_21 (note, the leading “1.” is just an automatically inserted line number, not part of the nvcc compiler flag):
I’m afraid I should have spelled things out more clearly. I wrote “.so” intending to refer to a linkable object file under Unix, but it read like the word “so”.
I think you are saying that even though the code might work, it would still need re-compiling to run. I do not have the source, but the developer has also created compute capability 2.0 and 2.1 versions of the linkable object. All die with the same error. I was hoping the straight comparison to a working card might help to identify the source of the problem. It seems very strange that it would run on one CUDA card and not another.
Sorry, I have no diagnosis for the problem you describe. It may be the library code was built with other switches than it claims to, or that there is a version mismatch in your software stack.
If code is built as a fat binary, one can ship a single dynamic library for all GPU architectures. This is how CUBLAS, CUFFT, CUSPARSE, etc are built. A fat binary typically contains binary code for all architectures one wants the code to run on, plus PTX for JITing on a future architecture. To create the fat binary one can simply pass additional instances of the -gencode switch I showed above to the nvcc invocation, one for each architecture that should be supported. So for double-precision code one might currently build for sm_13, sm_20, sm_30, and sm_35.
The CUDA driver has an API for loading fat binaries. The CUDA runtime uses that under the hood. At runtime, it determines the compute capability of the GPU currently bound to the CUDA context and tries to find a matching binary inside the fat binary. If no matching binary can be found, it looks for PTX that can be JIT compiled. If it cannot find suitable PTX either, the kernel launch fails.
While the PTX JIT mechanism works just fine, it may not deliver optimal performance since the code compiled for an older architecture could be compiled under restrictions no longer present on the current architecture. Occasionally it happens that JIT compiled PTX code from an older architecture runs faster than natively compiled code for the new architecture, through a fortuitous combination of compiler artifacts, or because the JIT compiler is newer and more highly optimizing.
In general I would recommend adding a new -gencode switch to a library build when support for the next GPU architecture is added.
Finally figured this out and it was nothing to do with the card. There was a dynamic link to a second CUDA module and the wrong version of the second module had been left in the target directory. Once matching versions were in place everything worked beautifully.