Run CUDA code assuming implict warp synchronization on Volta _without_modifying code

HannesF99 · July 4, 2018, 8:16am

On GPUS with Volta architecture or later, it is unsafe to assume that the threads of a warp operate in lockstep (see e.g. Using CUDA Warp-Level Primitives | NVIDIA Technical Blog).

We have a CUDA library which relies on implicit warp synchronization behaviour on several places (e.g. within intra-warp reduction etc…). Is it possible to run these library routines on Volta GPUS correctly, without changing the source code of the library ? Inspecting the source code and changing all affected code sections is quite a task, as the library routines are quite complex.

We use CUDA toolkit 8.X or 9.X. One possible solution I have thought of is to set the NVCC compiler flags properly, so that for Volta GPUs (or later) no native machine code or PTX code is generated. Instead, I want to add PTX code for Pascal (the last architecture for which implict warp synchronization is guaranteed) to the library via ‘-gencode arch=compute_61,code=compute_61’. Is that a viable and correct solution (or better: workaround) ?

cbuchner1 · July 4, 2018, 8:49am

Targeting compute 6.x (Pascal) should do the trick (best if PTX code is included) if you also make sure that no compute 7.x PTX or SASS is available in the binary.

You could also look into the PTXAS documentation. It may have a switch to enable backwards compatible warp synchronization even when targeting Volta. But I can’t find an online PTXAS manual right now to verify.

EDIT: might the PTXAS option –legacy-bar-warp-wide-behavior be the right one? But then I am puzzled that the ptxas --help output states that it is ignored for sm_70 targets and above. Instead I would have expected it to be used only for sm_70 targets and above. So either it’s a bug of the --help output or I misunderstand what this switch is supposed to do.

(with the option -Xptxas it is possible to forward such options from nvcc to ptxas)

Robert_Crovella · July 4, 2018, 3:07pm

For CUDA toolkit 9.x, a method to “opt-in” to “Pascal thread scheduling” on a volta machine is given here:

[url]https://devblogs.nvidia.com/using-cuda-warp-level-primitives/[/url]

from the last paragraph:

“One last trick. If your existing CUDA program gives a different result on Volta architecture GPUs, and you suspect the difference is caused by Volta’s new independent thread scheduling which can change warp synchronous behavior, you may want to recompile your program with nvcc options -arch=compute_60 -code=sm_70. Such compiled programs opt-in to Pascal’s thread scheduling. When used selectively, it can help pin down the culprit module more quickly, allowing you to update the code to avoid implicit warp-synchronous programming.”

I’m not suggesting this provides guarantees about the correctness of any particular code. It is offered up there as a tool to aid in analysis of warp-synchronous coding patterns, ostensibly for the purpose of removing those to be compliant with cc7.0 and future architectures.

This methodology would allow cc7.0 SASS code to be present in a binary compiled this way, which might make some forms of detailed analysis a bit easier.

Topic		Replies	Views
Using CUDA Warp-Level Primitives Technical Blog	20	2144	April 15, 2024
Upgrading self CUDA 7.0 CUDA Programming and Performance	2	488	December 17, 2017
is syncthreads needed when will divergent threads in same warp re-sync CUDA Programming and Performance	9	3351	January 23, 2012
volatile in CUDA Fortran Legacy PGI Compilers	5	5901	August 25, 2012
are threads of a warp really sync? CUDA Programming and Performance	2	827	August 3, 2011
How does CUDA synchronize a while loop in a kernel? CUDA Programming and Performance	5	1292	June 6, 2022
Why __syncwarp is necessary in undivergent warp reduction? CUDA Programming and Performance	6	4052	April 1, 2022
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3234	October 5, 2015
about the __syncwarp() in P100 CUDA Programming and Performance	11	4205	June 6, 2018
unexpected behaviour from atomics on Volta CUDA Programming and Performance	12	1686	May 31, 2019

Run CUDA code assuming implict warp synchronization on Volta _without_modifying code

Related topics