Support multiple compute capabilities

How should I arrange to support multiple different compute capabilities when compiling my pipeline shader programs? Is there an automatic way to handle them as in CUDA?

The easiest way is to simply compile your PTX to the lowest supported streaming multiprocessor target which is SM 5.0 in OptiX 6 and newer.

See these threads.

You could also have multiple versions of your PTX code generated for the individual SM targets and check the SM version of the CUDA device and load the matching PTX input code, but I wouldn’t expect any dramatic performance changes from that
There actually have been cases where the OptiX PTX parser was not handling the newest SM versions.

OptiX parses your input PTX and recompiles it anyway and outputs intermediate code with the GPU’s SM target version.
Then the PTX assembler and SASS code generator inside the CUDA driver optimize that again to the final microcode.

The CUDA toolkit and display drivers should have more impact. For example CUDA 8.0 generated a lot better code than CUDA 7.5. And since the PTX assembler and microcode generator also ship with the display driver, as well as the OptiX core implementation and ray tracing drivers, you’d benefit from all optimizations in newer drivers automatically.

OK. Thanks for the tip!

Following up on this question: I have essentially the same code which may be called as a CUDA kernel or through optixLaunch. The PTX generated for for CUDA and OptiX look the same. However, I get different results from exactly the same input. After putting in a lot of print statements, it appears that the PTX is being converted to different SASS. In one case, the compiler is turning a SAXPY operation into an FMA (CUDA) but in the other (OptiX), it remains separate multiplication and add operations. In the PTX in both cases, the SAXPY is written out as separate multiplication and addition operations; so, it appears that the SASS generators are doing different things.

Is is possible to control the OptiX SASS generator? Or look at the SASS code it generates?



Yes, OptiX and CUDA are compiled differently, the function call ABIs are incompatible, and one cannot in general expect them to be identical at the instruction level. OptiX does not have an API to export the SASS, but you can use a debugger or Nsight Compute to inspect your kernel SASS. The fact that your OptiX kernel is failing to promote your multiply+add into an FMA could be a bug or oversight, there are some cases where we have to infer that it needs to happen, when the FMA is not explicit in your PTX. Are you able to share your kernel code that reproduces the issue with missing FMA? Please send it to the optix-help mailing list if you can.


I’ll see if I can simplify my code. What is the optix-help mailing list?

You can use the name of the list at nvidia dot com to send any files you’d like to share with us privately (I’m trying to avoid typing it out because we get spam sometimes). It’s also fine to DM one of us privately, or post here on the forum if you don’t care or would prefer to share publicly.


Got it :-)

After I looked at a couple of cases of differences in my code, I saw that all of the differences were related to SAXPY operations when it’s subtraction rather than addition (SAXMY?):
d = a * x - y
The PTX always shows a separate multiplication followed by the subtraction. In CUDA this is turned into an FMA instruction in the SASS, but OptiX seems to miss the opportunity.

I sent an example to the optix-help mailing list that I hope is helpful.