CUDA 12/13 `-arch` flag no longer produces "universal" binaries

I have been using nvcc -arch=sm_xx to build my executable ( GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - Physically accurate and validated GPU ray-tracer ) for releases/deployment in the past 15 years. When setting the sm_xx to be the lowest CC (such as sm_30 for cuda 10) that the installed nvcc supports, this binary is both backward compatible with all previous generation GPUs (as old as sm_xx), as well as forward compatible with all newer or even unborn future GPUs (is that called a fat-binary?). This makes it very convenient to package the binary for deployment and understand that it can run across a large percentage of nvidia GPUs (and future-proof).

starting from cuda 12.x, using a single -arch=sm_xx flag to produce “universal“ binaries no longer works. For example, when compiling using nvcc -arch=sm_75 using cuda 12.x to build my program, the output binary can not run on the 4090 GPU on the very system the cuda was installed - it gave this error instead

the provided PTX was compiled with an unsupported toolchain.

asking AI chatbot about this, the suggested workaround is to attach -arch=sm_xx,compute_xx, … -arch=… for every CC that the program intends to support (and the version of nvcc supports), but this is a significantly inferior solution compared to the old behavior, it fails in two ways

  1. it loses forward-compatibility that allows the binary to run on newer GPUs (than the highest specified -arch) and future GPUs that are to be released
  2. it makes the command non scalable and non-portable, with long and verbose flags, make it difficult to maintain and build

perhaps the AI chatbot does not know everything (I hope) - I am wondering if nvidia developers can suggest any better workaround to allow the compiled binary to restore its both backward and forward incompatibilities with all NVIDIA GPUs, especially future generations.

you can reproduce the above error using the following commands (on a GPU newer than sm_75) and with cuda 12.x or 13

git clone git@github.com:fangq/mcx.git

cd mcx/src
make
../bin/mcx --bench cube60

the same command works fine on cuda 11 or all older cuda versions regardless of your GPU CC level

Perhaps -arch=all is what you are looking for.

The downside to using your previous method, “setting the sm_xx to be the lowest CC (such as sm_30 for cuda 10) that the installed nvcc supports”, is that the PTX included in the binary is also for sm_30. This means that when this is executed on a later card, with newer capabilities/instructions, eg. tensor cores, these are unused as PTX sm_30 has no knowledge of them, when it is used by the driver to JIT compile sm_”later card” code to run.

I don’t believe there has been a change in the specified behavior.

The solution is and always has been to update the driver on that machine. Or as already suggested, in some cases it may be sufficient to build an sm-specific binary for any/all arches needed (although with consumer cards, this will generally still require a CUDA version reported by the driver to be equal or greater than the CUDA version used to compile the code - i.e. the CUDA “Runtime” version). The CUDA samples generally demonstrate this in their makefiles, in addition to the suggested arch switch.

It has always been the case, that if you compile PTX with a compiler version that is newer (i.e. a compiler that generates a PTX version that is newer) than what is recognized by a specific driver version, then the PTX will not be JIT-compiled. This is not new in any way for CUDA 12 or any other. The reason CUDA 11 appears to work is that as you move backwards in CUDA versions, eventually your compiler becomes old enough that the PTX version generated is compatible with whatever driver you happen to have on that machine.

You could demonstrate a similar breakpoint with various combinations of drivers and CUDA versions.

People who want to maintain machines for use with newer built binaries, especially when we are talking about consumer cards, need to update their GPU drivers from time to time. If you want to support “older” drivers, especially with PTX, you cannot move your build machines forward arbitrarily.

None of this is new, or unique or specific to any CUDA version. The general principles have been in effect since earliest CUDA days, and the general phenomenon has been occurring since approximately the same time.

What Robert Crovella said.

For what it is worth, what the AI chatbot suggested echoes traditional recommendations regarding the correct way to build “universal” binaries: build a fat binary that contains SASS for all GPU architectures one wishes to support plus PTX for the latest GPU architecture supported by the compiler.

It has likewise always been understood that inclusion of PTX can improve future compatibility (and is thus a best practice) but does not provide an iron-clad guarantee. That is only achieved by installing the latest drier package.

One issue that may play into OP’s observations is that for many (15+, I think) years the CUDA installer would offer to also install the latest driver package if not already present, but this feature has been removed with CUDA 13 (by my reading of the release notes; I haven’t actually installed it myself yet), meaning users need to take separate action to upgrade the driver package.

thank you both for your prompt rely.

I maintain all the GPU servers used by my group. To avoid the kernel/driver version mismatch, which forces to reboot the servers, I had to lock the driver version (currently, all locked at 535) using apt-mark hold, because of that, I was not able to test this until recently.

I just upgraded one of the servers’ driver to nvidia-driver-580, and recompiled my code on this node - which has an A100 and a GTX 2080S - using cuda 12.5. The compiled binary runs fine on this node for both GPUs, however, when running the same binary (statically linked with libcudart-static) on a different server, which has a 4090, I got the following error again

MCX ERROR(-222):the provided PTX was compiled with an unsupported toolchain. in unit mcx_core.cu:3256

the host for the 4090 has driver 535.

if I go back to the first server, recompiling the binary using cuda 11.3 with static linking, the generated binary can be executed on both machines, regardless of the driver versions.

can you help me understand this difference in the generated binary from cuda 11.3 and 12.5? why cuda 11.3 compiled binary can run on different driver versions, but that by 12.5 failed?

is 580 new enough for cuda 12.5 to produce “universal“ binaries again? again, the only -arch flag I use in my above build script was `-arch=sm_50`

previously I did experiments adding -arch/-compute to embed SASS for specific GPU in the binary, but I did not see any speed improvement in nearly all tested cases (across different GPU arch). So it appears to me that my kernel is not specifically affected by new GPU capabilities other than more cores. Because of that, I decided only to include PTX for portability.

Totally expected, and you could have run a similar experiment 3 years ago and observed the same outcome (of course, adjusting to an older set of drivers and toolchains.) Or you could do it right now, e.g. with CUDA 11.5 and CUDA 10.2 (just to pick some examples). Note that to do a faithful equivalent experiment, the target machine with CUDA 10.2 must have a CUDA 10.2-era driver installed. Then you can witness similar outcomes for a similar experiment. This is not a new development.

There are at least 2 mechanisms at play.

  1. The binary has embedded PTX (which is needed, i.e. actually used to attempt to create the SASS object that will actually execute). In this case, a PTX object in a fatbinary includes a PTX version. The version is not the same as the CUDA version that was used to create the PTX, but it increments along with CUDA version increments. When you have a PTX object that is being used to create/provide the final executable that will be used in a given scenario, the necessary sequence includes JIT-compilation by the GPU driver (in the target machine). The GPU driver “recognizes” up to a particular PTX version. If the PTX version in the PTX object is “newer” than the version that the driver recognizes, you will get an error message that pretty much says exactly that. This particular item can be “worked around” by not relying on PTX to provide the executable that will be used, and this involves specifying whatever arch is required for SASS object creation/inclusion in the fatbinary. Or you can update the GPU driver on the executable target machine, to one that is new enough to recognize whatever PTX version is indicated in the PTX object. Or you could revert the compilation machine to an older CUDA version, old enough that the indicated PTX version in the PTX object is old enough to be “recognized” by the GPU driver on the target machine.

  2. A suitable SASS object is available. Either because PTX JIT compilation was successful, or because an arch-appropriate SASS object was specified for inclusion in the fatbinary during compilation/creation. In this case, another check is required. The CUDA runtime version must be “consistent” with the CUDA driver version that is associated with the GPU driver in the target executable machine. This gets into a fairly involved topic of “CUDA compatibility” and the error message will be different (if a failure occurs here) than the one that will be emitted for a failure in item 1. I don’t want to write a long treatise here, but there is documentation as well as numerous questions about it that you can find. If we ignore the possible permissible exceptions, the basic notion is that the CUDA driver version of the executable/target machine must be greater than or equal to the CUDA runtime version that was used to compile the executable (SASS) object. The typical error message associated with a failure here is “CUDA driver version is insufficient for CUDA runtime version” again, a fairly exact description of the requirement, as well as one that has numerous public forum questions about it. Without devolving into discussion of CUDA compatibility provided by specific libraries and only applicable to certain GPU families, I would say the general solution to this kind of problem is also to update the GPU driver on the target machine. You could also switch the compilation machine to an older version, and you could also investigate CUDA compatibility via libraries - documented at the link I indicated above.

(someone may question: if PTX JIT compilation were successful, could you ever actually witness the second error scenario - “cuda driver version is insufficient…”. I don’t think so, but this is a hairy topic and I haven’t fully convinced myself that there is no possible gap there. )

I’m not going to directly answer that for a few reasons, one is that I don’t want to interpret a definition of “universal” - that is fraught with peril not only for the current audience but future readers. Second, if you want some attempt at universality, I believe you need to try to understand what I have written above. However I will say that the idea of 580 being “new enough” seems wrong to me. The general tension here is that I want the CUDA/driver versions on the compilation machine to be “older” than those used on the target/execution machines. So asking if 580 is “new enough” on the compilation machine to “produce” a universal binary that will be used elsewhere is backward to the concept that actually applies here.

thank you again. I think I got it. sorry that I did not read your previous reply more carefully.

this is what I learned

  • each new CUDA version requires a min driver version that increases over CUDA releases
  • the PTX built by each CUDA version is also associated with a PTX version number, which also increases over CUDA releases
  • if SASS was not requested at compile time, a fat binary built by a specific CUDA can only run over the driver version that supports the PTX version
  • if SASS was created at compile time (on a machine with a driver new enough to support the specified CC), when running this SASS on a GPU of the same CC but over a different machine, then the driver of the target machine still needs to satisify some requirements in order to run this SASS, even they are built for this CC
  • If I want to create a single binary that can support as many client GPU arch as possible, and support as many client driver versions as possible (which is what I meant by “universal“ - agreed that it is not accurate), then the only way to do it is to build it with a CUDA as old as possible (that can still build the binary), so that the PTX has the lowest possible version. This binary should be able to run on any driver version as long as it is greater than the min version needed by the old CUDA used to build.
  • the risk of using old CUDA or use SASS-free fat binary is that the compiler may not be powerful enough to optimize the code or utilize new GPU resources, but it is code/GPU feature dependent - in my case, I did not find neither newer CUDA nor newer SASS improves speed

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.