On GPUS with Volta architecture or later, it is unsafe to assume that the threads of a warp operate in lockstep (see e.g. https://devblogs.nvidia.com/using-cuda-warp-level-primitives/).
We have a CUDA library which relies on implicit warp synchronization behaviour on several places (e.g. within intra-warp reduction etc…). Is it possible to run these library routines on Volta GPUS correctly, without changing the source code of the library ? Inspecting the source code and changing all affected code sections is quite a task, as the library routines are quite complex.
We use CUDA toolkit 8.X or 9.X. One possible solution I have thought of is to set the NVCC compiler flags properly, so that for Volta GPUs (or later) no native machine code or PTX code is generated. Instead, I want to add PTX code for Pascal (the last architecture for which implict warp synchronization is guaranteed) to the library via ‘-gencode arch=compute_61,code=compute_61’. Is that a viable and correct solution (or better: workaround) ?