OptiX Time for Launch

Hello OptiX People,

currently i am writing my masters thesis using OptiX in a VR context. Unfortunately the performance is way below what it needs to be, to be viable for a VR context.
I tied tracing down what consumes so much time per frame and could find, that the actual tracing takes about 5-10ms. I meassured the time for a launch, but if i delete every line of code inside the entrypoint program it still needs 3ms. What am i doing wrong? 3ms for a basically empty program seems to be too much.

The structure of my program involves many OpenGL interop buffers, only ~5 primitives in implicit representation, an acceleration structure per primitive and an empty one over all primitives. 2 raytypes and entypoints are needed and 3 launches are performed per frame (left eye, right eye, some laser ray tracing).

So my question: What does Optix do when calling launch? How can i lower its startup time?

Here is some report from OptiX:

[2][INFO        ] Launch index 1115.

[2][SCENE STAT  ]     Node graph object summary:

[2][SCENE STAT  ]         RTprogram         : 32

[2][SCENE STAT  ]         RTbuffer          : 24

[2][SCENE STAT  ]         RTtexturesampler  : 2

[2][SCENE STAT  ]         RTacceleration    : 6

[2][SCENE STAT  ]         RTgroup           : 1

[2][SCENE STAT  ]         RTgeometrygroup   : 5

[2][SCENE STAT  ]         RTtransform       : 5

[2][SCENE STAT  ]         RTselector        : 0

[2][SCENE STAT  ]         RTgeometryinstance: 5

[2][SCENE STAT  ]         RTgeometry        : 3

[2][SCENE STAT  ]             Total prim: 3

[2][SCENE STAT  ]         RTmaterial        : 6

[2][TIMING      ]     Acceleration update time: 0.0 ms

[2][MEM USAGE   ]     Buffer GPU memory usage:

[2][MEM USAGE   ]     |         Category |  Count |  Total MByte |

[2][MEM USAGE   ]     |           buffer |     15 |        123.9 |

[2][MEM USAGE   ]     |          texture |      2 |          6.0 |

[2][MEM USAGE   ]     |      gfx interop |      3 |         23.2 |

[2][MEM USAGE   ]     |     cuda interop |      0 |          0.0 |

[2][MEM USAGE   ]     |   optix internal |     19 |          0.0 |

[2][MEM USAGE   ]     Buffer host memory usage: 7.0 Mbytes

[2][MEM USAGE   ]     Local memory for all threads (CUDA device: 0): 384.4 MBytes

[1][TIMING      ]     Total launch time: 37.2 ms

What is your system configuration?
OS version, installed GPU(s), display driver version, OptiX version (major.minor. micro), CUDA toolkit version, host compiler version?
How big is your launch size?
What’s your resulting image data format?

“The structure of my program involves many OpenGL interop buffers”
That’s probably part of the problem. Each launch needs to register and unregister them to do interop.
Are you dynamically copying stuff back and forth between the APIs?
Are you sure the 3ms isn’t just the memory copying?

“only ~5 primitives in implicit representation,”
Why do you need 32 programs?

“an acceleration structure per primitive”
Because you’re having transforms over each of them.
Are they each different primitives with different intersection and bounding box programs?

“and an empty one over all primitives.”
If you mean NoAccel, then each ray is tested against each primitive instead of an early out when missing all. Use the same acceleration builder for the root group as for the GeometryGroups.

“2 raytypes and entypoints are needed and 3 launches are performed per frame (left eye, right eye, some laser ray tracing).”
A ray tracer can do arbitrary camera projections, you should be able to render stereo images in a single launch.

Since you’re only using 3 Geometry nodes and five GeometryInstances you’re doing Goemetry instancing, that means you could share the Acceleration structures of the instanced Geometry among the GeometryInstances using them.

If you’re new to OptiX, I’d recommend to watch my GTC 2018 OptiX Introduction talk and work through the open source examples. All links here:
https://devtalk.nvidia.com/default/topic/998546/optix/optix-advanced-samples-on-github/

If you use the optixIntro_04 as comparison, that will need about 3 ms in FullHD on a rather fast Quadro P6000 to render and display when dollying out so that basically only the miss shader is invoked.
The launch overhead of an empty scene is like 0.05 ms. So the 3 ms come from the primary rays, accumulating and copying the float4 data around. Means you’re compute and memory bandwidth limited on your system. Using half4 or bgra8 formats would make that introduction example faster.

System configuration: Windows 10, Visual Studio 2017, GTX980Ti (24.21.13.9744), Xeon E5-1650, OptiX 5.1.0
Launch size: 1628x1809, 1628x1809, 50x50x300
Data format: Byte4 and Float, Byte4 and Float, Float3

I am copying from the OptiX Buffer to an OpenGL texture via a Pixel Unpack Buffer. The time measured did not include the copy.

I am creating new programs for every primitive, I should change this and do instancing with different parameters. The other ones are coming from {miss, exception, entrypoint 0, entrypoint 1, geometry, materials}.

Yes, one transform per primitive.
All primitives use the same intersection code with different parameters resulting in different bounding boxes, but yes, same program.

Ok, do i then need to mark it dirty every time a transform below it changes?

I implemented the tracing for both eyes in one stage, but it resulted in nearly the same performance, even, if like this one launch is omitted.

Does OptiX copy every interop buffer before using it? Or simply gets a pointer for it? I only use these buffers as “output” buffers to output data to OpenGL. Is there any method to fasten up the launch when using the interop buffers?

Try to avoid float3 output buffers.
http://raytracing-docs.nvidia.com/optix/guide/index.html#performance#13001

“Ok, do i then need to mark it dirty every time a transform below it changes?”

Yes, when a transform node changes, all acceleration structures (AS) above that should be marked dirty. Use acceleration properties “refit”, “1” on the root to not rebuild the AS.

When using OpenGL textures with internalFormat GL_RGBA8, make the user format GL_BGRA and type GL_UNSIGNED_BYTE to hit the fastest path. Maybe use glTexSubImage2D instead.

Other than that I would simply try how fast it is to fill all buffers with some data in the three launches and then how fast it is to transfer the data.
Bench that for the image generating and the last launch separately.

Then I would add one primitive at a time and see how that changes the performance.
If your intersection program is heavy, that will possibly slow things down a lot at those resolutions.

“Does OptiX copy every interop buffer before using it?”

That would defeat the idea if interoperability.
But the OpenGL upload to the texture will copy it from the PBO. That’s a device-to-device blit though.

I checked every instance of program creation and carefully evaluated if it needs to be there. In a standard scene not many programs are saved like this, but in an extreme case it should now scale much better. Also, the sharing (and usage at all) of acceleration structures at least gave a 4fps boost to now ~17fps! Thank you!

For my buffers I checked and changed the one Float3 buffer to a Float4 buffer.

In my next tests I want to make the tests suggested to see how the application scales and how much time the copying of data needs, but I do not expect that it is much time.

I got another question concerning the stack size. I read that it can heavily impact the performance. How can I find out how big the stack needs to be? Is there some reporting function?

Also, the performance guide specifies how to align the PerRayData structs, but I am not sure if I understood it correctly. Is the alignment of the following structs the best way? (biggest to front)

struct PerRayData_radiance_iterative{
  float3	origin;
  float3	direction;
  float		power;
  int		depth;

  bool		done;
  bool		hit_lens;
};

struct PerRayData_radiance{
  float3	result;
  float		hit_depth;
  float		importance;
  int		depth;
  
  bool		miss;
  bool 		hit_lens;
};

See CUDA alignment rules (Size and Alignment Requirement)
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types

My simple rules to make sure that the CUDA alignments of vector types are fulfilled:

First float4, int4, uint4 (16 bytes aligned),
then float2 (8 bytes aligned)
then all others: float3, float, int, uint (all 4 byte aligned)
and bool either encoded as bits in an uint bitfield
or bool, char, uchar at last (1 byte aligned)

If you use that inside arrays, manually pad that structure at the end to the biggest alignment requirement.

Your structures look ok.

On the stack size, use this to determine the minimal stack size today.
https://devtalk.nvidia.com/default/topic/1004649/?comment=5130084

Search the OptiX forum for “setStackSize” for some more explanations.

Thank you for your suggestions! With fixed bounding boxes (had a bug, making them infinite in one direction) and acceleration structures the framerate is now at 30~40fps when looking at geometry from a normal view (dropping if very close and screen filling).

With your tutorial I found out, that the stacksize could be a quarter of what it was, now at 2000byte. (not much performance impact)

Adding way more geometry in my scene does impact the performance, but it is not that realistic. At max there are maybe 10 primitives. So I used my static scene from before for the following measurements.

I did the measurements without looking at the geometry to measure a dry run:

One eye:

  • Rendering complete 7250µs
  • Only moving the camera 22µs
  • Only launch 7200µs
  • Only empty launch 3960µs
  • Only copy to texture 2µs (can this really be true?)

On the empty launch I used this code for primary ray casting, else the same scene and it still needs 3960µs:

#include <optix.h>

RT_PROGRAM void pinhole_camera(){}

Is this really only initialization? Can i do anything about it?

Did you measure that after some launches?
The very first launch does a lot more work and is not representative.

You have VSYNC disabled when benchmaring?

Again, as a feasibility check please try to run my OptiX Introduction examples and see what the performance on your system is when dollying out the scene in optixIntro_04.
I don’t know what your workload is or if you’re doing something wrong and need a known baseline.

That’s a brute force path tracer with a Lambert material and is setup by default to render paths of length two with all white materials in a white environment producing ambient occlusion images, but you can change colors and path length in the GUI.

I don’t have a Maxwell based board at hand right now, but if that is not running at >300 fps in FullHD when only the miss shader is invoked then your baseline performance of that system is not fast enough for VR ray tracing.
90 Hz means 11 ms time, so each of your launches should be below 3 ms. Your summed launch dimension is 3.2 times bigger than FullHD. I don’t expect that the GXT 980Ti has the necessary compute power to ray trace that in the given time budget.

The time was measured after about 5sec and the timings were averaged over 50 samples.

Unfortunately I can’t disable V-Sync because the HMD forces it. But for the time measuring this shouldn’t change much (only for fps in the end). It should only change the waiting between the launches, as far as I know v-sync.

For the performance part: Sorry, I forgot to mention, that the measurements today were made on another computer. The hardware here is a GTX1080, everything else is the same.

The introduction 4 (modified with ‘glfwSwapInterval(0)’ to disable vsync) and dollied out so far, that only a few pixels of the geometry were visible, resulted in 255fps in 1920x1080px.

I also tied implementing the stereo rendering in one pass and it results in nearly the same performance (few 100µs). So upscaling the buffers and rendering then does not help.