I’m working on CUDA support for the Julia programming language, GitHub - JuliaGPU/CUDA.jl: CUDA programming in Julia., where we offer a CUDA C-like programming environment, but in Julia (a higher-level language). We expose kernel programming APIs at a similar abstraction level as CUDA C, e.g., you can write kernel functions and call
sync_threads(). That gets lowered to the LLVM
llvm.nvvm.barrier0 intrinsic, which gets compiled to the
bar.sync PTX instruction.
Now, the Julia compiler can do what we call union-splitting: if you invoke a function with values that may be differently-typed, the compiler will insert a branch that checks the type fo the value at run time, and branches to either specialization or implementation of said function. This poses a problem if that function performs synchronization, because we will effectively now execute a
bar.sync from diverged contexts, and may cause a deadlock.
Disregarding performance here (well-optimized kernels won’t behave like this), I’d want to at least make it possible to execute some basic instances of union-split code (it’s not possible in general since the individual implementations branched to might be completely different, e.g., using different mechanisms for thread communication based on the datatype). So I was considering switching our
sync_threads implementation to
llvm.nvvm.barrier.sync, which emits an unaligned
barrier.sync that allows synchronization from different points in code, such as differently-typed versions of the same function.
How (un)safe of a change is this? Has this been done before? I could use different barrier names for each separate synchronization to avoid unrelated barriers from matching up, but it seems to me that there would be bigger problems if this were possible.