I read the optix programming guide and it says barriers are not allowed in the input PTX code, however we need it quite badly. Is there a way to have it with some workarounds?
In our work, we want all the threads to reach a same point and wait, which is exactly the function of thread barrier in CUDA. I tried to implement something like below as a test:
In this example, to every thread when it starts, it add 1 to the global thread count so that I can understand how many threads have been executed, and when the code reaches the end, the count gets 1 subtracted. I also have another count when the code reaches the long waiting function (the “waiting count” in the example pseudo code). However the numbers appear to be non-consistent, i.e., even I make the waiting loop infinite, the number of global_count and global_waiting_count doesn’t match. I can understand that optix is essentially executing threads asynchrounosly and when too many threads are running infinite loop the computation resource (warps probably) gets fully occupied. If I can replace the loop with a barrier, this kernel should run as expected.
Is it possible anyway?
Typically barriers in CUDA are used to synchronize at the warp or block level, that’s what the OptiX programming guide is referring to. In order to synchronize all threads - meaning sync at the kernel level - you are allowed to use CUDA stream features and host-side barriers with OptiX, such as cudaDeviceSynchronize() or cudaStreamSynchronize().
Why not break this kernel into several separate kernel launches? Just issuing them on the same CUDA stream will cause them to synchronize exactly in the way your code is written. There is an implicit cudaStreamSyncronize() in between kernels on the same stream. If you want any other behavior, like allowing several of the intermediate kernels to run in parallel, you can use multiple CUDA streams, with stream events for synchronization.
Hey David thanks for clarifying the use of barrieres and the suggestion. However, as I considered, the states of threads differ (imagine in path tracing some paths already terminated while some are still bouncing), therefore the cost of recording and managing these states in buffer makes it not a good choice in my situation. That is, I would like a kernel to block at certain point rather than getting divided into several small ones. While I’m now sort of managed to achieve what I want, I feel that it will be more efficient if thread synchronization can be achieved.
One option would be a wavefront architecture, which is exactly as you describe: storing the path state in a buffer. While it’s not ideal for the reasons you mention, it is not uncommon for people to use a wavefront architecture for GPU rendering, since it can sometimes be easier to scale than the alternative megakernel approach you currently have. One advantage of using a wavefront architecture is that if you have some paths already terminated, and some paths are still bouncing, you have the option to reduce your kernel size with every launch, instead of allowing some threads to become inactive. With your current setup, all the paths that terminate will become inactive threads which waste some time and reduce thread coherence, as fewer and fewer paths are left.
Another option might be to separate into sub-kernels that all share as much state as possible, so you can track most of your state at a global level, and minimize the per-thread state that you would need to store in a buffer.
There is a CUDA 9+ feature called Cooperative Thread Groups that sort-of has some of the syncronization properties you want. It can syncronize all the threads in a kernel, but only if you launch a number of thread blocks that does not exceed the number of SMs. Programming Guide :: CUDA Toolkit Documentation I haven’t used this, but I suspect the block number constraint means the total number of threads cannot exceed your total number of CUDA cores on your GPU, meaning this probably wouldn’t work for you even if you could use it. But, either way, it is not available from an OptiX Launch, so this is not currently an option.
So currently the only option for synchronizing all threads in a ray tracing launch is to use separate kernels. I recommend trying it and measuring the performance before assuming that the cost of the state buffer would slow it down. There is some cost to saving the ray state, of course, but maybe it’s less that you fear, or maybe achieving the synchronization you’re after will provide more benefit than the cost of saving state.
Yes I can imagine that it’s not practical for all threads of a GPU to synchronize during kernel execution, maybe even the number of registers is not enough for that.We are working on a production level renderer which needs to store a lot of intermediate data for example AOVs and corresponding throughput values (which also benefits from CUDA’s flexible register allocation, I guess), consequently a wavefront architecture will occupy really huge amount of memory, for example it can go several gigabytes and scales with resolution, while, the memory budget is already very tight for production rendering. I will take it as a limitation and use some workarounds to solve it though, and take a look at that CUDA9 feature. Thank you again for so much information!