OptiX 8: optixTrace() vs optixTraverse()+optixInvoke() performance

Hello,

I observe a systematic performance degradation when optixTrace() is simply replaced with optixTraverse(); optixInvoke() calls (no optixReorder() used). The difference gets bigger when more complex and multiple shaders are involved, Is that expected or I should pay attention to something else when trying to implement the new approach?

The only change I am making in the code is:

	optixTrace(
		handle,
		ray_origin,
		ray_direction,
		tmin,
		tmax,
		0.0f,
		OptixVisibilityMask(1),
		OPTIX_RAY_FLAG_NONE,
		RAY_TYPE_RADIANCE,
		RAY_TYPE_COUNT,
		RAY_TYPE_RADIANCE,
		u0, u1, g0, g1);

replaced with:

	optixTraverse(
		handle,
		ray_origin,
		ray_direction,
		tmin,
		tmax,
		0.0f,
		OptixVisibilityMask(1),
		OPTIX_RAY_FLAG_NONE,
		RAY_TYPE_RADIANCE,
		RAY_TYPE_COUNT,
		RAY_TYPE_RADIANCE,
		u0, u1, g0, g1);
	optixInvoke(u0, u1, g0, g1);

Code is running on rtx4090 and driver 552.12.

Could you quantify that performance degradation with absolute numbers please?
What differences in performance with how many shaders of what complexity are we talking about?

I shortly tested that with my MDL_renderer example and the scene description scene_mdl_vMaterials.txt and I see a small difference between optixTrace and optixTraverse/Invoke inside the non-SER code path of the integrator which is like 113.25 vs. 112.6 samples per seconds on an RTX 6000 Ada running Widows 10 and 545.84 drivers.
(Same as used here: https://forums.developer.nvidia.com/t/optix-advanced-samples-on-github/48410/14 )
I’m not using a lot of different hit records. That renderer is configuring materials with a few similar hit programs and the rest is done with direct callable programs.

So I would expect a minor difference, maybe due to some hit object data handling.

I would need to test R550 drivers next week if you see anything worse.

Have you checked with Nsight Compute if there is any obvious difference in behavior between the two modes?

If you’re not using optixReorder, there is not much incentive to replace optixTrace against optixTraverse/Involke just because you can.
There are some methods where optixTraverse can be used instead of optixTrace, like the quick shadow ray test shown inside the optixPathTracer example which also does SER.

OK… I think I have found the reason. In the more complex scenes I am also casting occlusion rays, and these were done using optixTrace() also when optixTraverse(); optixInvoke() was used for radiance. Mixing these two approaches gives me slowdown of ~20%. When I swich everywhere to optixTraverse(); optixInvoke() the slowdown is ~1%.
This is anyway important to know, since I’d like to cast 2-3 radiance rays from the primary hit position, re-using manually created hit object. Now I am doing this including also the traversal of the primary ray each time with the optixTrace(). So I know that the change requires changing also that part with occlusion rays.

I started yesterday with testing SER, but got significantly worse performance and started investigating… that was the aim of running optixTraverse(); optixInvoke() without SER.

Mixing these two approaches gives me slowdown of ~20%.

Now that is actually weird. That is not what I have seen above.

I only changed the radiance ray shot inside the ray generation program from optixTrace to optixTraverse/Invoke:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/MDL_renderer/shaders/raygeneration.cu#L176
I did not change the shadow ray shot inside the hit programs:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/MDL_renderer/shaders/hit.cu#L454

This is anyway important to know, since I’d like to cast 2-3 radiance rays from the primary hit position

From which program domain?
I’m not doing that. Radiance rays are only shot from the ray generation program. There is no recursion in my path tracers at all.
The ray generation program is also where the optixReorder belongs.

re-using manually created hit object.

Why though? There is an implicit hit object after optixTraverse. You’re saving that into a manually generated hit object to be able to reuse it for the following optixTraverse calls? (Code would be helpful.)

(Sidenote issue about hit objects and transformations:
https://forums.developer.nvidia.com/t/understanding-optixtransformnormalfromobjecttoworldspace/285169 )

I started yesterday with testing SER, but got significantly worse performance and started investigating

It can still be that SER is a loss when you’re using too much local memory.
Please read the very first link I posted above with performance experiments I had done.