Simple PTX shader - OptiX 7

You can find an example of a Cmake macro to build PTX in the OptiX SDK’s Cmake files, specifically look at the file Cmake/Macros


My OptiX 7 application frameworks are not using SDK sutils functions and the CMake scripts compiling *.cu files to *.ptx are getting an explicit list of *.cu files plus dependencies and compile options.
That is easier to strip down to the bare minimum than the CMake methods used inside the OptiX SDK.

Also the naming scheme is gets translated to name.ptx and placed in a folder relative to the application.
No bloated *.ptx filenames or hardcoded paths like in the SDK.

Have a look at these places:

Generating custom build rules for *.cu to *.ptx translation:
using this *.cmake function:
Reading a *.ptx source into a string:
Building OptixModules from them:

More condensed code in the later examples, building all OptixProgramGroup elements with a single call:

Thank you @dhart and @droettger.

Quick question @droettger, if I may, when I run cmake with the NVCUDA_COMPILE_PTX I get the following error:

Unknown CMake command “NVCUDA_COMPILE_PTX”.

This is probably simple, but is there a script I am missing? If so, is there a way to put that script in the current directory (where the cmake is being executed)? I would like to keep the work as isolated as possible.

Thank you again for all the help.

Yes, that *.cmake file needs to be included in some CMakeLists.txt before you can use it.

In my examples that happens inside the top-most CMakeLists.txt which configures the overall CMake build environment
and that it finds that nvcuda_compile_ptx.cmake file inside my local 3rdparty/CMake folder at all, is due to setting the CMAKE_MODULE_PATH to that folder one instruction above that.

1 Like

Cool, it works - Thank you very much @droettger
I still have a lot to learn, not just about OptiX but apparently CMAKE

Hi @droettger
Thank you again for the assist with nvcuda_compile_ptx.cmake - I put it in my local CMake directory and called it to build a PTX file from CUDA code perfectly. However, when I tried to call optixTrace(...) in that CUDA code I get ptx compile error along the lines of the following:
ptaxs C:/Users/.../..._triangle_compute_35.ptx, Line 26; error: Label expected for argument 0

I have a triangle.h file with the following struct defined:

// OptiX 7
#include <optix.h>
#include <cuda_runtime.h>
// Vector math
#include “vec_math.h”
struct Params {
uchar4* image;
unsigned int image_width;
unsigned int image_height;
float3 cam_eye;
float3 cam_u, cam_v, cam_w;
OptixTraversableHandle handle;

The CUDA code that is being converted to PTX follows:

#include “triangle.h”

extern “C” {
__constant__ Params params;

static __forceinline__ __device__ void computeRay(uint3 idx, uint3 dim,
float3& origin, float3& direction) {
const float3 U = params.cam_u;
const float3 V = params.cam_v;
const float3 W = params.cam_w;
const float2 d = 2.0f * make_float2(
static_cast(idx.x) / static_cast(dim.x),
static_cast(idx.y) / static_cast(dim.y)) - 1.0f;

origin = params.cam_eye;
direction = normalize(d.x * U + d.y * V + W);

extern “C” __global__ void __raygen__rg() {
const uint3 idx = optixGetLaunchIndex();
const uint3 dim = optixGetLaunchDimensions();

float3 ray_origin, ray_direction;
computeRay(idx, dim, ray_origin, ray_direction);

unsigned int p0, p1, p2;

// Problem occurs when I add the following call:
optixTrace(params.handle, ray_origin, ray_direction,
0.0f, 1e16f, 0.0f, OptixVisibilityMask(255),
OPTIX_RAY_FLAG_NONE, 0, 1, 0, p0, p1, p2);

} // __raygen_rg(…)

The CMakeLists.txt file I am using follows:


# Set the PTX directory base

# Module path to cmake nvcuda_compile_ptx.cmake that is employed
# to build PTX code from CUDA

# Use NVCUDA_COMPILE_PTX function to produce the desired custom
# rule and output filenames when compiling OptiX from *.cu to
# *.ptx


# Compile CUDA source code to PTX shader(s)
TARGET_PATH “{PTX_TARGET_DIR}/haney_ptx" GENERATED_FILES PTX_SOURCES NVCC_OPTIONS "--gpu-architecture=compute_50" "--relocatable-device-code=true" "--Wno-deprecated-gpu-targets" "-I{OPTIX7_INCLUDE_DIR}”

#message("PTX_SOURCES = " “${PTX_SOURCES}”)

include_directories("." ${OPTIX7_INCLUDE_DIR})

add_executable(xOptixShader {SOURCES} {SHADERS} {SHADERS_HEADERS} {PTX_SOURCES})


I am hoping that I am doing something simple, but am kind of lost as to why I am getting the error when I use the optixTrace(...). Any help would be great.

Thanks again.

You’re saying you get an error from ptxas?
The PTX assembler is not involved in the compilation step from OptiX *.cu to *.ptx source files.
If this is happening during compilation of your project that means you setup your build process incorrectly.

If you compare your provided CMake code with the one in
you’ll notice that you’re missing a lot of $ signs before the CMake variables.
Not sure why this generated any solution at all.

I would change all of the bold marked variables and then some more things. See the list of issues below.






NVCC_OPTIONS “–gpu-architecture=compute_50” “–relocatable-device-code=true” “–Wno-deprecated-gpu-targets” "-I{OPTIX7_INCLUDE_DIR}” “-I${CMAKE_CURRENT_SOURCE_DIR}”)




1.) When using the FindOptiX7.cmake from my examples, those let CMake generate the “FOUND” variable and that matches the spelling of the FindOptiX7.cmake, otherwise newer CMake versions throw a warning. Means it should be OptiX7_FOUND when you didn’t declare that yourself.

2.) OPTIX7_INCLUDE_DIR inside the NVCUDA_COMPILE_PTX() call is missing the $ sign which means none of the OptiX device functions should have worked, which is effectively what you’re reporting.
That is true for all other bold variables above. I have no idea if that even generated anything valid inside the solution.

3.) You’re not using "--use_fast_math" in the NVCC_OPTIONS which will make your device code runsa lot slower, if at all . Please always prefer PTX compiled with --use_fast_math.

4.) In all OptiX device structs I would recommend to order the individual fields by their CUDA alignment restrictions to not have any compiler induced padding inside the structure.
Means both the uchar pointer and the OptixTraversableHandle are 64-bit values which are 8-byte aligned and I would place them at the beginning since all remaining float3 and unsigned int fields have an alignment of 4-bytes.
Compare to this:

Please always specify your development environment when asking about any OptiX issues:
OS version, installed GPU(s), display driver version, OptiX version (major.minor.micro), CUDA toolkit version, host compiler version.

Thank you @droettger for the reply.

The missing ‘$’ in my post must be an artifact of a copy-paste mistake because in my actual CMakeLists.txt file they exist. Thank you for the information on byte alignment in my struct(s) - very useful as well as the --use_fast_math NVCC_OPTIONS. The ‘OPTIX7_FOUND’ and associated ‘OPTIX7_INCLUDE_DIR’ are from a CMake FindOptix7 script that is custom for our code (modeled after your FindOptix7 cmake script) and it works for other OptiX7 code I wrote - granted very simple (just initializes Optix7 and exits).

I apologize for not listing the OS version, installed GPU(s), etc. - it was the end of a long day before the holiday. The OS version I am using is Windows 10 version 1809 (OS build 17763.1637), CUDA 11.1 (cuda_11.1.relgpu_drvr455TC455_06.29190527_0), Nvidia card: GeForce MX130, Nvidia Driver version: 456.81, CUDA compilation tools: release 11.1, V11.1.105, using OptiX 7.0.0

Still having same problem(s) when I try and include optixTrace(...) function call in, but everything compiles properly when I don’t include optixTrace(...) function call. Apparently this function call from OptiX 7.0.0 is not being found ? So still confused as to why this would be.

What’s your host compiler version?

If you compile the OptiX SDK 7.x examples or my OptiX 7 applications on your development environment, do they work?
If yes, there is some setup error in your project.
The recommendation would then be to start from a working example and change it to your needs.

Again, the compilation from OptiX device *.cu source files to *.ptx source files does not invoke the PTX assembler at all.
There cannot be any ptxas error in that step unless something went seriously wrong with your project setup.

See this related post and follow the links in there as well:

I described here how to get the exact NVCC command lines generated inside the OptiX SDK 7 CMake environment:

It’s also prepared inside the nvcuda_compile_ptx.cmake function:
Uncomment that line and the NVCC command lines get printed into the CMake output window.

Means you should be able to call these explicitly on a standard command prompt window (from inside the CUDA bin directory if tthe PATH environment varibles aren’t setup correctly) and if that is working but not inside your project, then there is again something broken in your project setup which cannot be seen in code excerpts.

The host is MSVS 2019 version 16.8.4

I can get your example code to compile on my local machine. So I will try and backtrack by building from your example(s).

Thanks again for the assist.

Hi @droettger, modifying your example code I was able to get it all to compile excepting for the following during compilation:
nvcc fatal : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures

Is there an argument I could pass to cuda_compile_and_embed to remedy this?

Thanks again.

EDIT: I think I have fixed the issue. It was a corrupt FindOptix.cmake script I was calling - my bad.

Thanks again for all the help @droettger.
I created the triangle OptiX 7 code found on

However, when I try to write the result to an image file (ppm) from the Params struct I just get a black screen (all inputs are NULL in the ppm file). Can you tell me if I am doing something dumb copying the result back from the DEVICE to HOST? The following is the code snippet where I think the problem may be where d_param is on device (as per the example list on webpage):

// Now rendered results from the launch are in params.image
Params hparams = (Params)malloc(sizeof(Params));
cudaMemcpy(hparams, reinterpret_cast<void*>(d_param), sizeof(Params),
string fout(“myTriangle.ppm”);
fio.WritePPM(hparams->image, width, height, fout.c_str());
if(hparams) delete hparams;

ANY help or hints would be much appreciated.

You are aware that this blog describes how the optixTriangle example inside the OptiX SDK 7.x works?
Means there already exists the working example code inside the SDK you can compare against your version.

The code you posted is not copying an image! It’s copying the launch parameter structure itself from device to host.
That’s unnecessary because you have that structure filled in host memory somewhere earlier and the device copy is constant so nothing changes in it on the device.

Instead you would need to copy from the device buffer you stored into the CUdeviceptr Params::image and with width * height * sizeof(uchar4) bytes in that example.

Debug through the saveImage() code path inside the OptiX SDk 7.x optixTriangle.cpp code:

// Display results
    sutil::ImageBuffer buffer;         = output_buffer.getHostPointer();
    buffer.width        = width;
    buffer.height       = height;
    buffer.pixel_format = sutil::BufferImageFormat::UNSIGNED_BYTE4;
    if( outfile.empty() )
        sutil::displayBufferWindow( argv[0], buffer );
        sutil::saveImage( outfile.c_str(), buffer, false );

That example is using a small helper class to abstract the CUDA buffers for direct CUdeviceptr and OpenGL interop via Pixel Buffer Objects (PBO) and pinned host memory (zero-copy).

The optixTriangle is setting up a CUDA buffer without interop here:
sutil::CUDAOutputBuffer<uchar4> output_buffer( sutil::CUDAOutputBufferType::CUDA_DEVICE, width, height );

Debug through the getHostPointer() call to see that it’s using that buffer’s CUdeviceptr directly to copy the image from device to host. Note that the map() call in there returns the required CUdeviceptr of that buffer.

template <typename PIXEL_FORMAT>
PIXEL_FORMAT* CUDAOutputBuffer<PIXEL_FORMAT>::getHostPointer()
    if( m_type == CUDAOutputBufferType::CUDA_DEVICE ||
        m_type == CUDAOutputBufferType::CUDA_P2P ||
        m_type == CUDAOutputBufferType::GL_INTEROP  )
        m_host_pixels.resize( m_width*m_height );

        CUDA_CHECK( cudaMemcpy(
                    static_cast<void*>( ),
                    ) );

    else // m_type == CUDAOutputBufferType::ZERO_COPY
        return m_host_zcopy_pixels;

Thank you @droettger . Your answer was spot on and I follow it meaning I have all the paths to the include directories needed (e.g. SDK, include, sutils, etc). However when I add the following line:
sutil::CUDAOutputBuffer<uchar4> output_buffer(sutil::CUDAOutputBufferType::CUDA_DEVICE, width, height );

I get a “unresolved external symbol glad_glGetError” - I was successful in installing OptiX 7.0 on my Windows PC, so I’m not completely sure why the glad_gl is not found. Any pointers/hints would be greatly appreciated.

Thanks again

glad is an OpenGL helper code which loads OpenGL entry point functions, similar to the GLEW library.
It consists of only two files in the OptiX SDK 7.x.0\SDK\support\glad folder.
The glad.h is included in various SDK examples calling OpenGL functions and also in CUDAOutputBuffer.h.

You cannot include arbitrary headers into your code base without also having the necessary dependencies resolved.
Means you either dissect that sutil source code and grab only the minimum necessary pieces which do exactly what you need, or you use the whole sutil library, but that is really only meant to simplify the OptiX SDK examples and not production code.

In your case you would only need to use the correct CUdeviceptr for the image copy from device to host which is a single code line.
I’m not using any of that sutil code in my examples, and copying a float4 rendered image from device to host looks like this:
CUDA Runtime API:
CUDA Driver API:

That is really all what is remaining when you ignore the OpenGL interop methods.

1 Like

I really appreciate you helping me out @droettger . You have been invaluable to my experience(s) with OptiX.

I will try to follow your code, since at this point I just want a no-frills image rendered directly to a host-side file (maybe ppm).

Thanks again.

Quick question, if that is okay. Can anyone illuminate the use of Shader Binding Table verses passing a parameter(s) to a shader via a struct?

Just looking for an experienced developer’s opinion on SBT in real-world operations.

Thanks again.

Inside the SBT you can have a programmable amount of data behind the 32 bytes header which identifies the shader.
Means you can have different data per SBT record entry which can be directly read in a closest hit program for example.
(Hmm, there is no limit on the data size documented in OptiX 7. In Vulkan the data size is limited by the maximum stride between record entries.)

The SBT has a lot of flexibility, for example, you can have an entry per instance of a geometry, or you could only have all shaders exactly once and determine a different behavior per instance with the user defined instance ID, while the user defined instance SBT offset picks the shader.
I normally use the former method but know of applications doing the latter.
For that case the instance index would be used to fetch data from data you prepared in some buffer.

Then there are GAS which have multiple SBT entries, which allows to let sets of primitives behave differently depending on their SBT contents.
Not using the SBT for this would require a different index per triangle to pick the required data. That’s more expensive.

“Passing data to a shader via a struct” would require that you already know which shader is called beforehand, but you shoot your primary ray inside the ray generation not knowing which hit record will be reached (or missed), so how would you pick the contents of the struct inside the ray generation program? This would then need to be part of the per ray payload as index, pointer, or copy somehow. Waste of time if you missed.

There is exactly only one structure passed as launch parameters which resides in constant memory (limited in size).
But you can have CUdeviceptr in that launch parameter struct, which point to buffers in which you can pass as much data as you want, limited by your available VRAM only.

Here’s a nice comparison of the the SBT mechanisms in the different raytracing APIs (DXR, Vulkan KHR_ray_tracing(_pipeline), OptiX 7) with interactive part where you can construct an SBT:

Thank you @droettger for the information and link.