Run CUDA code assuming implict warp synchronization on Volta _without_modifying code

On GPUS with Volta architecture or later, it is unsafe to assume that the threads of a warp operate in lockstep (see e.g. Using CUDA Warp-Level Primitives | NVIDIA Technical Blog).

We have a CUDA library which relies on implicit warp synchronization behaviour on several places (e.g. within intra-warp reduction etc…). Is it possible to run these library routines on Volta GPUS correctly, without changing the source code of the library ? Inspecting the source code and changing all affected code sections is quite a task, as the library routines are quite complex.

We use CUDA toolkit 8.X or 9.X. One possible solution I have thought of is to set the NVCC compiler flags properly, so that for Volta GPUs (or later) no native machine code or PTX code is generated. Instead, I want to add PTX code for Pascal (the last architecture for which implict warp synchronization is guaranteed) to the library via ‘-gencode arch=compute_61,code=compute_61’. Is that a viable and correct solution (or better: workaround) ?

Targeting compute 6.x (Pascal) should do the trick (best if PTX code is included) if you also make sure that no compute 7.x PTX or SASS is available in the binary.

You could also look into the PTXAS documentation. It may have a switch to enable backwards compatible warp synchronization even when targeting Volta. But I can’t find an online PTXAS manual right now to verify.

EDIT: might the PTXAS option –legacy-bar-warp-wide-behavior be the right one? But then I am puzzled that the ptxas --help output states that it is ignored for sm_70 targets and above. Instead I would have expected it to be used only for sm_70 targets and above. So either it’s a bug of the --help output or I misunderstand what this switch is supposed to do.

(with the option -Xptxas it is possible to forward such options from nvcc to ptxas)

For CUDA toolkit 9.x, a method to “opt-in” to “Pascal thread scheduling” on a volta machine is given here:

[url]https://devblogs.nvidia.com/using-cuda-warp-level-primitives/[/url]

from the last paragraph:

“One last trick. If your existing CUDA program gives a different result on Volta architecture GPUs, and you suspect the difference is caused by Volta’s new independent thread scheduling which can change warp synchronous behavior, you may want to recompile your program with nvcc options -arch=compute_60 -code=sm_70. Such compiled programs opt-in to Pascal’s thread scheduling. When used selectively, it can help pin down the culprit module more quickly, allowing you to update the code to avoid implicit warp-synchronous programming.”

I’m not suggesting this provides guarantees about the correctness of any particular code. It is offered up there as a tool to aid in analysis of warp-synchronous coding patterns, ostensibly for the purpose of removing those to be compliant with cc7.0 and future architectures.

This methodology would allow cc7.0 SASS code to be present in a binary compiled this way, which might make some forms of detailed analysis a bit easier.