I need to use shared memory and HW accelerated ray tracing, and setting up DirectX and using interoperability is too much work. (I’ve also been using CUB for reduction scan etc. operations in that kernel; I need to port them to a compute shader as well). I am basically using custom software ray tracing currently, I have wanted to check how much performance I am losing currently.
OptiX is a single threaded programming model and does not currently support using shared memory, because unlike CUDA, your OptiX threads are not guaranteed to stay in the same shared memory block during the course of their execution. This wouldn’t change even if OptiX had an inline trace call. (But there’s no reason you couldn’t use a single extremely minimal hit program in OptiX that returns only minimal payload info such as the hit-t value to kind-of simulate an inline trace call, if you want one for other reasons.)
Mixing OptiX launches and CUDA launches in the same CUDA stream is legal and valid, so you can continue running CUB reductions before or after an OptiX launch, for example. Depending on what you need, it sounds like the upcoming Shader Execution Reordering may be something you could use to get what you need out of OptiX and compare it to your software ray tracing kernel?
Inline Ray Tracing is a must for path connection algorithm. It is very helpful. Can not imagine how complex it could be without inline ray tracing. Without inline rt, I have to use lots of Wavefront Scheme, copy and pass many intermediate buffers. And can not use any shared mem and block sync. Please give inline raytracing a consideration.
I’m not sure I understand what you mean. We currently provide an OptiX SDK sample called “optixPathTracing” that has a path connection algorithm, is fairly simple, and does not use inline ray tracing nor a wavefront scheme. Why do you feel forced to use a wavefront? Can you elaborate on what your needs & blockers with OptiX are?
I had an example. For a conventional path reusing algorithm, you are trying to do MIS for reusing path segments sampled by neighbor pixels, which means you need to compute a normalization factor from a block of pixels when reusing a sample(path). It’s natural to amortize the cost by reusing each sample for the entire block at the same time. So shared memory and block sync can be used to compute the weight. Maybe inline ray tracing is not very efficient in terms of performance, but it can be used to implement complex algorithms quickly and make researchers happy. If inline raytracing is available, I can do shadow ray tests and compute the weightSum in the same shader (the visibility test must be successful then you can reuse the sample) when connecting paths sampled by other pixels to the current pixel. Without it, I can just store the visibility results in a buffer and read that buffer in a following final gathering compute shader.