One possibilty is you ray trace your object UV map, instead of your object. You can store your positions in the UV map. Then you can compute your lighting or what ever from hit positions:
0, do general scene graph and acceleration structure setup.
1, create a Optix buffer A, and store your object UV map with positions (world positions over UV space) into the buffer which could be the same size of your output_buffer.
2, in ray generation (your camera function), read from your buffer A, if there is valid data in the buffer, read the position and generate a ray (from your view to that position) Edit: Ray of “view to position” may make your ray blocked, which might not whatyou need. As Detlef said, you could set position as your ray origin.
3, the another things should be no difference to a common ray tracer.
Yes, you have the full control about which rays to shoot in OptiX.
Though if you’re a beginner with OptiX or ray tracing in general there is a steep learning curve ahead in the following pseudo algorithm description.
There are basically only two main things to solve, finding the world positions per texel center (once), and integrating the incoming light, most recommended with a progressive algorithm (many OptiX launch() calls with the texture size).
Let’s start simple and assume you have a single texture map which is mapped to some geometry and you want to bake incoming light into the individual texels of that map.
Means you have a grid of texels you want to write to, that’s your output buffer, and your launch size.
Then you have some geometry with texture coordinates which map this texture uniquely, means each texel is used only once on that geometry. Let’s say the geometry is built from triangles.
If you want to do this perfectly per texel, you would need to calculate where on your geometry each texel center lies in world coordinates. That’s possible because you know the texture size, the triangle geometry, and the texture coordinates on the triangles.
The simplest approach has quadratic complexity: Two loops: per texel, per triangle: If the texel’s uv coordinate is inside the triangle’s texture coordinates, then calculate the world position at the texel’s texture coordinate on the triangle by interpolating the triangle vertex coordinates, break. No duplicates possible!
It should be more efficient to go over all triangles and find the texels they touch. Texels which are not covered would need to be skipped during baking. They would remain black and could later be filled with some color of the surrounding areas to allow filtering of the texture without artifacts.
You do this only once in a pre-process. Those world coordinates are your ray origins. You can write these into a buffer you read in your ray generation program to initialize the OptiX Ray.
For each ray origin you also need to figure out the face normal at that point on the triangle. That’s the center of the hemisphere into which you need to shoot rays to integrate the incoming light at that texel.
(Possible would also be the shading normal here, for potentially smoother results, but then you would need to make sure ray directions over that hemisphere do not penetrate to the actual surface.)
Then you write a progressive renderer which integrates the incoming light at that texel with some algorithm shooting many rays into the upper hemisphere over that texel, defined by ray origin and triangle face normal, and accumulate the result into your output buffer by launch index (means per texel).
Normally that’s done with ambient occlusion to capture only the shadows because that’s a view independent method. Baking full global illumination isn’t going to look quite right when rendering with that, because many lighting effects are view dependent.
Following your suggestion I managed to write my first lightmapper :) There are still some issues to resolve like expanding uv edges (otherwise seams look fugly). But overall I’m supprised how good it looks. For sure new AI denoisers adds a lot and works great with GI lightmapping.
Currently my ray shoots from world_position + normal, with direction of inverted normal. Is this correct approach? When I was shooting from world_position with normal direction, I got ortho view from each face (which makes sense :))
Once again, Thank you very much for help! So far, I really enjoy working with Optix!
No. That would capture incoming light at a point which is not on your triangle and from the wrong directions. That point might even be inside some other geometry and you’d get black when doing that.
That is a correct direction for one of the hundreds or thousands of rays you would need to shoot per texel to integrate the incoming light! That’s why I said this needs to be a progressive algorithm.
The face normals of a triangle are normally defined on their front face, which in turn is defined by the triangle winding. In a right-handed coordinate system with counter-clockwise winding with float3 triangle vertices (v0, v1, v2) that would be
float3 face_normal = normalize(cross(v1 - v0, v2 - v0));
Now together with the world position (== ray origin on the triangle) that face normal defines the center of the upper hemisphere over that point.
That whole hemisphere needs to be integrated to capture the incoming light at that world position on that triangle.
That is done by shooting many rays (hundreds to thousands) from that single ray origin into its upper hemisphere with a cosine weighted distribution of ray directions. Means one of the ray directions can be that face normal. You need some random number generator to sample these directions.
Examples calculating these directions can be found inside the OptiX SDK, for example inside the optixPathTracer. Search for the code calling into cosine_sample_hemisphere, which works in local coordinates and gets transformed into world coordinates with an ortho-normal basis defined by your face normal:
Something like this gives you such a direction:
float z1 = rnd(current_prd.seed); // Random values in [0, 1), different per ray! seed needs to be different per launch.
float z2 = rnd(current_prd.seed);
float3 direction; // This will contain your ray direction in the end.
cosine_sample_hemisphere(z1, z2, direction); // Cosine weight hemisphere direction (in local coordinates).
optix::Onb onb(face_normal); // Some ortho-normal basis with z-axis == face_normal (in world coordinates).
onb.inverse_transform(direction); // Transform local direction into hemisphere direction above triangle (in world coordinates).
It’s pretty good, but there is still some room for improvement. When I’ve forced optix/cuda to use one tesla p100. One frame took 265ms. So with 4 x P100, one frame should take around 66 ms. Currently it takes 78 ms.
I’ve noticed that everything performs much faster in MultiGpu env when I use RT_BUFFER_GPU_LOCAL. However, understandably, for outputBuffer which I use to save lightmap, it causes articafts (black strips over rendered image).https://i.imgur.com/eK5eHKj.png
Is it possible to wait a little bit and get output_buffer in full when it has RT_BUFFER_GPU_LOCAL type?
If yes, then I would get 67ms per frame, which gives me 23x performance against base and almost perfect scallability result of 3,97x for 4 devices.
Another topic I need to investigate is progressive launch. For that I probably need to create progressive buffer bound to output buffer, and pool it’s values. Or is there some kind of progressive update callback?
Ok, I’ve found some resources on progressive streaming. Could improve performance further.
Scaling of multiple GPUs per OptiX context is not linear with the number of GPUs because rendering happens to pinned memory via the PCI-E bus from all of the boards which comes with some overhead. Compare to the scaling ratio on two boards.
If you do not have access to an NVIDIA VCA system with a matching OptiX server implementation, there is no benefit of using the progressive API inside OptiX. https://devtalk.nvidia.com/default/topic/883964/
The optixProgressiveVCA example demonstrates how the progressive API can be used.
I added a switch to my renderers to not present the individual accumulation steps done per launch, but only the launches in the first half second and then once every second, unless the accumulation is restarted. The main benefit esp for multi-GPU setups is to save the PCI-E bandwidth of texture transfer for the display since there is no OpenGL interop possible then. The renderer is accumulating anyway and that way you can also see some incremental improvements better if they are displayed only every second.
I wonder what kind of results would i get from splitting my 1024x1024 output buffer onto separate renders tasks and assign it to separate GPU. When I render image at half of it’s resolution, time goes down lineary.
So If I create 4 render task with 512x512 size, I should get linear improvement.
PS. 10ms seems like quite a big number for PCI-E bus overhead for relatively small task (1024x1024).
That’s all on application side and pretty basic stuff.
I use a configuration text file with a lot of settings instead of command line parameters to control my renderer setup. For example which devices to use, if OpenGL interop should be used or not, camera settings, etc. One of the options is a boolean flag controling if the output buffer should be displayed on each launch or in some time interval, one second in my case.
I use a timer class, a display() function which calls into a render() function and does the final texture blit, renders the GUI, and does a SwapBuffer.
The render() function does the OptiX launch() and then checks if the timer has passed beyond a threshold which should trigger an update of the OpenGL texture used to display the resulting image.
Only if the time has passed, the render function will copy the OptiX output buffer into the texture, advances the time threshold at which the next update should happen by one second, and since I have a timer running anyway, prints some performance numbers as well. The display routine then has a new image to work with.
Means the read of the output buffer happens only once a second unless anything required a restart of the accumulation (camera moved, material parameters changed, etc.). Then the render routine updates the texture immediately (actually I do that for the first half second to always see some accumulation) and advances the time threshold for the next present again. And so forth.
I also have slightly different behaviors for interactive rendering and benchmarking as well.
I’ve improved performance of path tracer further. However this only increased a gap between single and multi gpu performance.
In the end I did simple test, where I set one output buffer and multiple gpu_local or input ones and in ray generator program i just map output buffer to solid color and nothing else. This alone consumes 15ms on 4 GPU setup and 0ms on single GPU.
I’ve found a way to improve it. Instead of using float4 as output buffer, I use BYTE4. But this has other limitations as lack of precision, and if I want to have tone mapping, i need to do it before I populate output buffer.
So with BYTE4 as output_buffer, filling buffer with solid color on 4 GPU setup takes 3ms now.
However, I still believe better job management/orchiestration could fix this issue on optix level. I could simply split my 1024 by 4, do renders separatelly and have almost linear increase in performance on multi GPU setup. Alternative would be to do more in each launch at cost of interactivity, which I don’t need that much and could be throttled up and down.
Will check above two options and come back with results.
Checked heavier path ray tracing cycles and with 8x8 samples per launch, 4xGPU setup shined, and got almost linear increase. One frame took 2375 ms on one GPU and 607 ms on four. I’m quite happy with that as it’s not that far away from perfect 593 ms.
I need to correct my previous statement here.
It’s actually valid to use RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL buffers for accumulation over multiple GPUs because the OptiX load balancer is assigning static working sets.
How it does that is still abstracted by OptiX to be able to develop different work distributions in the future.
Just the final output needs to happen into a separate buffer which can be read on the host.
Yes, I went with the same approach. Combination of RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL for accumulation and slim BYTE4 for output buffer for display. This has a limit of not being able to do post processing outside launch, but I can live with it for now.
In future, It would be nice to be able to fetch, even with significant delay results from all gpus when using RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL.