so far I used ‘warp-synchronous programming’ in OpenCL to get optimal performance and make up for the fact that the nVIDIA GPUs can keep more warps in flight than Vulkan/OpenCL workgroups (= CUDA thread blocks). So my workgroups typically conist of 2-3 warps (64-96 threads). The warps can diverge, the usual workgroup barriers thus don’t work, and warp-synchronous programming with volatile variables is used instead.
Starting with Volta, this no longer works. CUDA has the new __syncwarp() function to deal with that, and I’m now searching for the Vulkan way to achieve the same thing, while porting my OpenCL code. Does anybody know?
The goal of Vulkan is to deliver maximum graphics performance, and the old warp-synchronous programming usually speeded things up by ~20% in my benchmarks, so I don’t want to lose this performance… And I don’t think nVIDIA want to lose it either ;-))
Many thanks for your help,