How does optix code compilation work?

Hi,

I’m trying to build an application that includes optix and other cuda libraries like cufft and cublas so I need to understand how the compilation works.

My impression is that optix-program code has to be compiled into PTX code and further compiled at run time. Other non-optix cpp and cu codes can be compiled normally into cubin codes.

What I don’t get is how the final in-time compilation and loading of PTX code work. Do I compile and link non-optix codes into an executable, let it know where the optix PTX code is, run it, and it will load and further compile the PTX code?

Thanks.

Huy.

Hi Huy,

Do I compile and link non-optix codes into an executable, let it know where the optix PTX code is, run it, and it will load and further compile the PTX code?

That’s exactly right. You compile your OptiX programs into PTX (or OptiX-IR), typically using nvcc at build time, and then at run time your program will call optixModuleCreateFromPTX() or optixModuleCreateFromPTXWithTasks() one or more times which will compile your intermediate code into binary form. You end up with one or more OptiX modules and link them together with optixPipelineCreate().

This entire process is specific to OptiX and doesn’t really overlap with other CUDA programs or libraries. The build & link process for anything else non-OptiX will be the same as it was before, there’s essentially no overlap as far as the build is concerned. If you can keep your CUDA and OptiX code cleanly separated, then you shouldn’t really have anything tricky to deal with. If you want to share any headers and/or code with both sides OptiX and CUDA, then you can end up with some amount of work to keep both compilers happy. It would be recommended and easiest, if you can, to try to keep the OptiX and CUDA code separated and not try to share symbols. Note I’m only talking about mixing the source code together, this has no bearing on whether you can mix OptiX and CUDA kernels at run time, nor whether you can pass buffer pointers back and forth, those things will work just fine in any case.


David.

I’m building the SDK examples and when CUDA_NVRTC_ENABLED is ON in CMakeLists.txt, I don’t see any ptx file created. Only when it’s OFF, I see them in build/lib/ptx. Why is that? Where are the ptx code when it’s ON?

With nvrtc, your source code is compiled into PTX on the fly “Just In Time” - nvrtc replaces nvcc as the C++ → PTX compiler in your build. The PTX in the case of JIT compiling is located only in memory, and you pass it directly to optixModuleCreateFromPTX() as a string without having to read or write a PTX file.

You can study the run-time build workflow if you look at the source in sutil.cpp.

The reason to use nvrtc is when you want to be able to change your OptiX shaders and re-run your application without having to run any build process. It’s usually just a minor convenience, and would typically be used for internal/private projects. The reason that most people use nvcc for professional applications is to avoid having to ship their OptiX shader source code.

The rest of the compilation process after you build the PTX is the same in either case.


David.

thank you for the response. i’ll try and probably have more questions later.

Please read this related thread and follow the links in there for more information on NVCC and NVRTC compilation for OptiX programs:
https://forums.developer.nvidia.com/t/nvrtc-missing-stdint-h/146318

The reason that most people use nvcc for professional applications is to avoid having to ship their OptiX shader source code.

When using NVRTC, you would also need the OptiX headers and the CUDA headers, which means both SDKs would need to be installed on the target machine.
NVRTC can only generate PTX device code. Last time I checked it was about three times faster than NVCC because it doesn’t write any files.
I’ve used it in the past to generate high-level CUDA source code for materials at runtime which then got translated to PTX for OptiX on the fly.

1 Like

To make sure that I got it correctly, NVRTC is not a program that can be called from terminal like nvcc (since I don’t see it in /usr/local/cuda/bin where nvcc is), but a library to be linked right?

So when compiling an optiX device code, I still invoke nvcc but link it with -lnvrtc?

NVRTC is not a program that can be called from terminal like nvcc but a library to be linked right?

Yes, NVRTC is a library with a few entry points with which you compile CUDA device source code to PTX device source code inside your application at runtime.

(I don’t know how some of the CUDA toolkit files are named under Liunx, so the following is how it looks under Windows:)

On the host side you need to include the nvrtc.h header (inside your CUDA/<version>/include folder) and link against the CUDA export library named nvrtc.lib (inside the CUDA/<version>/x64/lib folder to be able to compile your application doing nvrtc API calls.

What calls are needed can be found inside the OptiX SDK example framework when searching for nvrtc.

That export library only contains the interface of the dynamic link libraries which implement the actual NVRTC compiler and a precompiled standard library it needs.
These are located inside the CUDA/<version>/bin folder and are named with an nvrtc-prefix and the CUDA version, e.g. for Windows CUDA 10.1 they are named nvrtc64_101_0.dll and nvrtc-builtins64_101.dll.
These need to be redistributed along with the application.

As explained in the links I posted above, all headers which you’d need to compile the CUDA code would also be required on the target machine (and since license terms forbid shipping these with your application, the end user would need to install CUDA and OptiX SDKs on his/her own.)

Since NVRTC can only compile device code, care needs to be taken to never include any host compiler includes inadvertently (also described inside the linked threads), because you cannot expect a target system to have any compiler installed, at least under Windows.

So when compiling an optiX device code, I still invoke nvcc but link it with -lnvrtc?

Not sure I understand the question. If all your CUDA code is translated to PTX with NVRTC you wouldn’t need NVCC and vice versa. You can also compile all CUDA device code which never changes with NVCC during build time of your project and only translate dynamically generated CUDA sources with NVRTC to PTX at runtime.

If you do not have any need to generate CUDA device code at runtime, there is also no need to use NVRTC at all.
You should simply build everything with NVCC and ship the translated PTX code with your application.

That is what most applications do and what all OptiX SDK examples do when you disable NVRTC inside the CMake settings.

The -lnvrtc is a cuda flag that I think needed to include when compiling the host code (I’m running on linux).

Isn’t -lnvrtc just the telling the host compiler (gnucc, not nvcc) to import the nvrtc export library to be able to resolve the nvrtc API entry points?
Again you wouldn’t need to link against the NVRTC export library at all when not calling any nvrtc entry points inside your application’s host code.

Ok I think I got OptiX to compile and work with a Cuda application. My question now is can the ray parameter struct, Params, in the SDK, be templated? If so, how would that change the compilation process?

You want the structure which is used as launch parameter block in constant CUDA memory to be a template?
Why? Based on what template arguments?
How many different launch parameter structures would you need and how would that change the device programs accessing that data?

Usually you would implement a specific launch parameter structure which exactly matches to how the device programs inside one or multiple pipelines are coded. That is something you hardcode once and never touch again.

Or it’s not actually necessary to have different structs.
For example, if you need some pointers inside the types inside the launch parameter structure to point to memory of different formats, you can define the pointer as CUdeviceptr and reinterpret that to the desired type dynamically. (Mind that CUDA requires specific byte alignments for different vector types. You’ll get misaligned access errors when not adhering to the proper alignment.)
Example code here where I switch between float4 and half4 buffers with a compile time switch, but that could also be handled with a runtime parameter:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_denoiser/shaders/system_parameter.h#L43
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_denoiser/shaders/raygeneration.cu#L233

It’s not that you need to have different launch parameter structs per pipeline if they access different fields inside the struct either. Each could read only what it needs from a bigger structure.

Note that the constant CUDA memory the launch parameters reside in is limited to 64 kB. Means whatever big data you need to access globally, only store a pointer to that inside the launch parameters.
Make the launch parameter structure as small as possible. Place fields in them according to their alignment requirements (in decreasing alignment) to prevent the compiler from adding unnecessary padding.

Other than that, it’s a struct defined in a header and used inside C++ code. I don’t see why you shouldn’t be able to implement that as a template but I have never attempted this because that’s usually unnecessary.

Maybe you’re interested in launch parameter specialization? This is different than templating, but it allows you to conditionally compile some of your launch parameters as if they are constant values, thus allowing the compiler to elide them, and speed up your device code.


David.

ok i think i got the idea.

speaking of types, can optix do double precision, for acceleration structs, and launch parameters, like ray origins, directions et al…? i only see float

also, i see the launch parameter is defined as constant Params params in device code, while it’s d_params and copied from cpu by regular cudaMemcpy. beside the different names, i thought constant memory is copied by cudaMemcpyToSymbol. is something else happing under the hood?

speaking of types, can optix do double precision, for acceleration structs, and launch parameters, like ray origins, directions et al…? i only see float

No. None of the available ray tracing APIs (OptiX, DXR, Vulkan Raytracing) support double precision data in acceleration structures, the ray definition, the transforms, or any other built-in functionality. That is all 32-bit floating point precision.

What you put into the launch parameters or any other developer defined data structure or how you implement your device code is completely your choice. Means it’s possible to use doubles in your OptiX device code but unless there are quantifiable precision requirements it’s definitely not recommended to do so simply for performance reasons.

Note that double precision performance on standard desktop and mobile GPUs is dramatically slower than floating point performance except for some compute-only products which in turn have no hardware RT cores. (Your V100 is one of them.)
You can query the single to double precision performance ratio via the CUDA runtime API cudaGetDeviceProperties() or CUDA driver API cuDeviceGetAttribute() calls.

The forum has a search feature in the top right which can be limited to sub-forums when starting the search, for example, on the OptiX forum view. Please have a look into these previous discussions about that topic which explain some options. Look out for comments on watertight intersections in the results as well.
https://forums.developer.nvidia.com/search?q=double%20precision%20%23visualization%3Aoptix

also, i see the launch parameter is defined as constant Params params in device code, while it’s d_params and copied from cpu by regular cudaMemcpy. beside the different names, i thought constant memory is copied by cudaMemcpyToSymbol. is something else happing under the hood?

Yes, OptiX handles that for you. That’s why you need to provide the launch parameter struct name to the OptixPipelineCompileOptions pipelineLaunchParamsVariableName.
https://raytracing-docs.nvidia.com/optix7/api/struct_optix_pipeline_compile_options.html#a716d5238c52743e20dce1e92575c6802

the reason im asking for double precision is because i only want to do 2d graphics, polygonal objects, but from what i understand, in optix, the way to represent polygons are piecewise linear curves, with thickness.

so naturally, i set everything up on the xy plane, e.g. the acceleration structures, ray origins and directions et al, all have zeros for z coordinates, and with very small thickness (2d polygons’ edges have zero thickness). but i notice that if i set the thickness smaller than 3e-4, ray-object intersection/hit is wrong, as if the thickness is too small that some rays miss the objects that they should see.

The curve primitives in OptiX are not 2D. They are round 3D shapes, like cylinders or the volume built by sweeping a sphere with varying radius along a 3D curve. Their main use case is the implementation of hair strands in 3D renderers.

A lot of care has been taken to make the curve intersection algorithms as precise as possible, but depending on your scene and camera setup there could of course be precision issues from the finite floating point representation.
But ray tracing is not working like rasterization where line primitives will affect whole pixels depending on specific rasterization rules (like diamond exit) to either set or not set a pixel on the screen.

Note that curve primitives can be a lot thinner than one pixel depending on the camera setup. Means if the sampling of the fragments making up one pixel on the screen is not dense enough, you will simply not be able to hit the curve with all rays because the curve can fall between the discrete fragment sample points. That is unrelated to floating point vs. double precision. (Nyquist theorem comes to mind.)
That is why curves are usually rendered by partitioning each pixel into very many fragments which each define a primary ray to accumulate the hit and miss results from geometric primitives accurately. The OptiX SDK curve examples show that.

i only want to do 2d graphics, polygonal objects,
i set everything up on the xy plane, e.g. the acceleration structures, ray origins and directions et al, all have zeros for z coordinates

Are you saying you want to shoot rays in the same plane as the geometry?
Otherwise the z-component of neither the ray origin nor the ray direction should be zero if you plan to project the polygon outlines onto some camera plane.

What exactly do you want to implement?

The following assumes this is about 2D graphics in the usual sense.
Why do you think that would require ray tracing?

Even if you would do that with ray tracing, you wouldn’t need curve primitives for that but could also define your 2D polygonal objects by tessellating them into flat triangles instead which would be even more efficient with RTX boards do to the watertight triangle intersection hardware.
Still, the sampling of very thin triangles would run into the exact same camera sampling issues.

I think a rasterizer could handle that a lot faster and would also allow overlapping geometry by render order without the need to handle depth separation (painter algorithm). Precision could be increased by using multisampling.

Maybe have a look at the NVIDIA Path Rendering SDK instead which uses a dedicated OpenGL extension GL_NV_path_rendering to implement hardware accelerated resolution independent vector graphics.
https://developer.nvidia.com/nv-path-rendering
https://developer.nvidia.com/gpu-accelerated-path-rendering