I read the optix programming guide and it says barriers are not allowed in the input PTX code, however we need it quite badly. Is there a way to have it with some workarounds?
In our work, we want all the threads to reach a same point and wait, which is exactly the function of thread barrier in CUDA. I tried to implement something like below as a test:
global kernel()
{
global_count_add_atomic();
some_logic_code();
global_waiting_count_add_atomic();
long_waiting_loop();
global_waiting_count_subtract_atomic();
other_logic();
global_count_subtract_atomic();
}
In this example, to every thread when it starts, it add 1 to the global thread count so that I can understand how many threads have been executed, and when the code reaches the end, the count gets 1 subtracted. I also have another count when the code reaches the long waiting function (the “waiting count” in the example pseudo code). However the numbers appear to be non-consistent, i.e., even I make the waiting loop infinite, the number of global_count and global_waiting_count doesn’t match. I can understand that optix is essentially executing threads asynchrounosly and when too many threads are running infinite loop the computation resource (warps probably) gets fully occupied. If I can replace the loop with a barrier, this kernel should run as expected.
Is it possible anyway?