OptiX 7 and MSVS 2017 - Hello World on Windows 10

Good Morning,

This may be a rudimentary question but I am trying to build my first OptiX program using Microsoft Visual Studio 2017 on a Windows 10 machine, a kind of ‘hello world’ type of program nothing fancy at this point. Has anyone had success developing OptiX via MSVS 2017 on Windows 10 OS? If so, can you tell me how from scratch?

The OptiX version I am using is 7, with CUDA 11 on a GeForce RTX 2080 Super.

Any help would be greatly appreciated.

Thanks,

If you’re fine with using CMake to generate a solution with optionally multiple projects independently of the host compiler version and cross platform, then there exist quite some examples doing that.
These sticky posts contain links to multiple examples doing that differently than the OptiX SDK examples.
https://forums.developer.nvidia.com/t/optix-7-1-release/139962
https://forums.developer.nvidia.com/t/optix-advanced-samples-on-github/48410/4

My OptiX application framework is completely standalone. It’s not using anything from the OptiX SDK 7 except for the host and device API headers.
You could easily strip down one of the intro_* examples to just the ray generation program which colors the output buffer without shooting a single ray. That’s effectively the “optixHello” program in all SDKs.
The OptiX 7 SIGGRAPH course examples has that and my OptiX 5/6 introduction samples also have that.

If you mean you want to do that as a MSVS native project using the CUDA toolkit Visual Studio integration, things get a little more manual.
Assuming you know how to handle all other parts of a MSVS solution for the host code, the remaining problem is setting the correct CUDA compiler (NVCC) options on the *.cu files in your project to compile them to *.ptx input files for OptiX 7 with working and fast code. The rules are:

  • Use sm_50 and compute_50 architecture targets (e.g. --gpu-architecture=compute_50) to generate PTX code for the minimum supported GPU architecture Maxwell. That will work on any newer GPU as well. Note that this is deprecated in CUDA 11 and will result in warnings, CUDA 10.x won’t. You can use SM 6.0 (Pascal) as well to suppress these. (The OptiX SDK 7.1.0 examples do that.)
  • Use --machine=64 (-m64), only 64-bit code is supported in OptiX.
  • Do not compile to obj or cubin, the output needs to be --ptx.
  • Do not use debug flags -g and -G. That’s the default in the Debug target. OptiX might not handle all debug instrumentation.
  • Enable --use_fast_math to get faster code. That means .approx instructions for trigonometric functions and reciprocals and no inadvertent use of slow double precision floats.
  • Enable --relocatable-device-code=true (or -rdc) or --keep-device-functions. This is required to keep the CUDA compiler from eliminating direct or continuation callables as dead code because it doesn’t find a call to them in the same module.
  • Enable --generate-line-info if you want to profile your code with Nsight Compute in the future.

You can check the exact NVCC command line inside the OptiIX SDK examples for each *.cu file while generating the project inside CMake with an additional message output. Follow the link in this post:
https://forums.developer.nvidia.com/t/cmake-dont-compile-cuda-kernels/140937/6

@droettger Thank you for the information. Very helpful to an OptiX newbie :)

As I am trying to explore all possible options at this point. Do you know if there has been any success using CUDA to generate ray tracing ? For example, can I employ CUDA to generate a series of frames - not for real time rendering, just a buffer(s)? Could one generate these buffer(s) to get a measurement of FPS?

Thank you again for the help.

Do you know if there has been any success using CUDA to generate ray tracing?
For example, can I employ CUDA to generate a series of frames - not for real time rendering, just a buffer(s)?
Could one generate these buffer(s) to get a measurement of FPS?

I’m trying to interpret that.

First , there have been many ray tracing implementations using CUDA natively in the past.
OptiX itself is using CUDA internally and with OptiX 7 all the host interaction is also native CUDA code now which simplifies interoperability between CUDA and OptiX 7 a lot.

Now, if you mean using CUDA to generate the rays which are then used in OptiX, yes, of course.
You can implement your ray generation program as you like. If you have a number of rays in a buffer, it’s very simple to use them as input.
There exist renderer architectures where only the ray intersection part has been replaced with OptiX and everything else (shading, ray generation) run in CUDA.
There is even an OptiX SDK example demonstrating that mechanism named optixRaycasting in which native CUDA kernels generate rays and OptiX shoots them and returns hit/miss results, which are then shaded in a native CUDA kernel again.
You could also do that with multiple buffers you prepared up front.

But the issue with that is the amount of rays the RTX hardware can handle (>10 GRays/sec) is way higher then what you need to read from and write to VRAM at the same time. Means you’re normally limited by memory bandwidth and need to make sure you keep memory accesses as low as possible to gain most from the hardware’s capabilities.

There existed a whole ray intersection API called OptiX Prime in the past which was discontinued in OptiX 7.0.0 because of that very reason, and since the OptiX 7 API is much more flexible and host code is using native CUDA buffers, it was not necessary anymore.

Still, it’s faster to generate rays on the fly with arithmetic than to write them into buffers read by OptiX again. The goal for optimal performance with OptiX and CUDA in general is to make use of registers as best as possible.

Please have a look through the OptiX SDK examples first.
There are examples like the optixMeshViewer which loads glTF models and handles their material as well. That is a Whitted style ray tracer which is really fast.

Then all my OptiX 7 examples implement a simple global illumination unidirectional path tracer with a very flexible architecture. All of these examples contain a benchmark functionality as well.
The most advanced one (rtigo3) can render in arbitrary resolutions independently of the window client area with a fixed number of samples. It can load triangle mesh data from different model file formats and assign materials to them.
This one is explicitly meant to compare the performance between different multi-GPU workload distribution and compositing strategies, as well as different OpenGL interoperability methods for the final display of the result.
It also works with single GPU of course and contains a pure ray-tracing-only benchmark mode without the display part where the final image is written to disk only.

I would look at these first to get an impression of the RTX ray tracing performance.

Before doing that, please read this post about how to go about measuring fps without being limited to the monitor refresh rate:
https://forums.developer.nvidia.com/t/optix-6-5-demo-performance-concern/128404/2

@droettger,
I apologize that I wasn’t clearer with my question. I am newbie to ray tracing and OptiX but not so much with GPGPU computing, so maybe I don’t even know enough to ask the proper question :)

My interest in ray tracing is not in real time rendering but rather in just performance measurements of the operation itself for purposes of creating a buffer that could be used later - maybe Frames Per Second (FPS).

I have some old Vulkan code that does ray tracing (I didn’t write it) with no actual rendering and I would like to convert to either Optix or pure CUDA. My preference is CUDA given my background in GPGPU but if OptiX is better suited that is okay.

Thank you again for your help.

I have some old Vulkan code that does ray tracing (I didn’t write it) with no actual rendering and I would like to convert to either Optix or pure CUDA. My preference is CUDA given my background in GPGPU but if OptiX is better suited that is okay.

You are aware that NVIDIA drivers contain a Vulkan Ray Tracing extension today?
Here are some native Vulkan Ray Tracing examples. One of them is using the OptiX AI Denoiser as a post process.
https://github.com/nvpro-samples

You can only access the ray tracing hardware support in RTX boards via the ray tracing APIs DXR, Vulkan Raytracing and OptiX, where only OptiX has support for motion blur, multi-level hierarchies, and built-in curve primitives (new in 7.1.0).

If you’re fluent in CUDA, then OptiX 7 is the best fit, since the host code manages everything with the CUDA Runtime or Driver API (your choice, I prefer the driver API) and the device code is effectively CUDA C++ with additional OptiX device functions. It’s a header-only API.

Note that not all CUDA constructs are allowed in OptiX device code (shared memory, syncthreads, etc.) because OptiX is using a single ray programming model and controls all hardware scheduling internally.
https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#program-input

@droettger,
Thanks for the information. No I wasn’t aware that Nvidia drivers contain a Vulkan Ray Tracing extension - I do now though :)

I built a simple ray tracing program using CUDA - creates a ppm file from a matrix (frame buffer) and looks pretty good. However, you may be correct if I really want to get into ray tracing OptiX 7 may be the way to go.

Thank you again for all your help.

Thanks again to @droettger for all your assistance - great for an OptiX newbie like myself.

I found a link that has some great intro code for OptiX newbies that may be of some help [https://github.com/ingowald/optix7course]. Likely most have found this but, just in case.

Last question, if that is okay, can anyone tell me if there is a programatic way to determine and/or set the total number of rays being cast ? I would like to determine the cost of rendering as the number of rays increases as a function that will output to the standard console rather than the display buffer/screen - if that is possible.

Thanks.

Right, these examples and more are listed in both sticky posts I linked to in my first answer.

For performance reasons, I would be wary of the gdt/math/vec.h used in the SIGGRAPH course examples, because I do not expect them to result in vectorized loads and stores for 2- and 4-component vectors as will happen with the CUDA built-in vector types.
Instead I would recommend using the CUDA built-in vector types or at least base derived types on those to benefit from the faster vectorized load and store instructions.
Really, fast memory accesses are important.

Last question, if that is okay, can anyone tell me if there is a programmatic way to determine and/or set the total number of rays being cast ?

There is no automatic way to count or set that. You would need to implement that yourself.

“Setting” the number of rays cast is totally depending on your implementation. You completely control when an optixTrace() call is done or not.
It’s usual to limit the ray depth resp. path length in ray tracers globally. You must know the maximum number of recursive optixTrace() calls up front anyway because you cannot calculate the OptiX pipeline’s stack size otherwise.
I do not recommend using recursive ray tracers esp. not when they require a lot of stack space, because the maximum stack size has a hard limit at 64 kB today.

Counting the number of optixTrace() calls is pretty simple though.
You would just need to increment a counter before each optixTrace() call in your code.

There are different ways how to manage that counter.

I would hold that in the per ray payload and write it out to a buffer at the end of the ray generation program. That would also allow visualizing the number of rays as a heat map to see which parts of the scene used most rays.
Compare that with this method: https://forums.developer.nvidia.com/t/timing-rttrace-via-nvapi/129726

Then you can sum up the total number of rays counted in that buffer after each launch or at after the final frame either on the host or with a separate CUDA kernel and print the result to the console.

If this is in a progressive renderer you could also initialize the counter buffer to zero at the initial sub-frame and add counted rays over multiple sub-frames.

If this is meant to be used in a benchmark, the issue with that is that the additional counting will slow down the performance and you should be running the exact same launch with and without that counting mechanism to get really accurate results.

If performance wouldn’t matter for the counting you could also combine all results with atomicAdd() calls to a single counter location.

Thank you @droettger. Great information.