I have been trying to implement an efficient rasterizer for the past few months. I tried optix but the results were not very satisfactory.
So this is basically a follow-up from your previous post which already addressed the debug vs. release performance.
Could you please quantify your results again.
What performance do you require and what is the performance you achieve today?
I assume you’re still using the RTX 3060?
Do you have an opportunity to test your application on a faster GPU instead? (Higher-end board and/or newer GPU generation.)
I want to know if it’s possible to make optix lighter in order to reduce the time of rasterization. I used optix-8.0.0/SDK/optixTriangle code with GAS acceleration.
Without knowing how your source code and geometry data looks, there isn’t much to help with at this point.
100,000 triangles is not much for either a rasterizer or raytracer.
Raytracers excel with high triangle counts because of the spatial acceleration structure and they allow ordered rendering due to the closesthit finding. They scale with ray count (the more rays, the slower).
Rasterizers scale with triangle count on the vertex engine (the more triangles, the slower), and on the fragment count on the raster engine (the more fragments, the slower).
If all you need is to rasterize 100,000 triangles into a very huge image without depth information or expensive shading, have you implemented that with a rasterizer API (e.g. using OpenGL, Vulkan, DX12) before?
That would also require multiple tiles because there is an upper limit on the 2D image resolution inside the GPUs as well, e.g. 16384 x 16384 which would need 80 tiles to render 20 * 2^30 pixels.
I cannot say which method would be faster without implementing both.
Raytracing primary rays only with no shading is basically the maximum performance you can get from a raytracer. At low triangle counts that will reach the maximum rays/second a hardware can handle.
What is the rays/second you currently achieve?
Use the OPTIX_BUILD_FLAG_PREFER_FAST_TRACE flag for the acceleration structure build.
If the result is simply a boolean indicating if a ray hit any triangle, there isn’t even a need to encode that into more than a single byte. Even single bits would be possible but would require atomics to write. So that would reduce the amount of memory you would need to copy from the device to the host when needed.