Volatile keyword

ffsantos · October 30, 2019, 9:14pm

Hi guys,
I have a question about the use of the volatile keyword.
I’m trying to implement a kernel with all float point instructions duplicated, FP64, or FP32.
The NVCC compiler, however, is so expert that it simplifies some of the code, and I don’t want that.
I’m thinking of using the volatile keyword to avoid the optimization, and I would like to have your opinion on it and to ask if you think this is the right approach or if you would suggest something different.

I tried to use CUDA intrinsics instructions to duplicate the benchmarks and force the execution of the two copies.
However, I saw strange behavior on the NVPROF results when counting the instructions. I used two metrics on NVPROF: ipc and flop_count_dp_fma. And also measure the execution time of the kernel (using C++ time lib).

First, I tested a simple duplication for each DFMA in the kernel. However, with NVPROF, I noticed that the number of Instructions executed is the same for the duplicated kernel, and the one that has no operation is duplicated, which means that the compiler removed all copied instructions.

So, I placed the “volatile” C++ keyword to “avoid” the compiler removing the double duplicated instructions. Then, the IPC for non-duplicated kernel and the duplicated kernel is almost the same, and, as expected, the execution time for duplicated kernel doubled. The number of FMA instructions also are doubled on the duplicated kernel.

The main question is how dangerous it is to use the “volatile” keyword for this purpose?
I know that it is not for this purpose that volatile keyword exists, but considering the situation, I’m not able to avoid NVCC optimizations that remove duplicated instructions.

Thanks, Fernando

njuffa · October 30, 2019, 9:35pm

It used to be that the CUDA compiler performed no re-association of floating-point expressions other than aggressively pursuing FMA-contraction (FMUL & FADD → FMA). So far I have not seen evidence that this behavior has changed. In other words, the CUDA compiler is very conservative when it comes to floating-point expressions compared to most host compilers at default settings. FMA-contraction can be turned off with the compiler command line flag -fmad=false. With that in place you should see the source sequence of operations reflected in the machine code.

If you have a simple example where that is not the case, I’d like to see it. Note that optimizations such as constant propagation can still take place and eliminate some operations. However, that should only take place where the results are bit-wise identical before and after optimization. There have been a few bugs in the past where constants where propagated through floating-point operations incorrectly, i.e. not in a bit-wise identical fashion.

The compiler will also remove dead code relentlessly, so if that is what you are experiencing (“duplicated instructions” seems to suggest it), you need to make sure that all operations feed into a globally visible result eventually via a dependency chain. For benchmarks usage, I usually use an addition chain. The overhead can be removed from the benchmark measurements by calibration.

To suppress FMA-merging locally, you could code with intrinsics, e.g. __fadd_rn(), __fmul_rn(), __dadd_rn(), __dmul_rn(). You can also experiment with reducing the optimization level of the compiler backend PTXAS, with the -Xptxas -O{0|1|2|3} flag, where -Xptxas -O3 is the compiler default.

Using the volatile modifier in the manner you envision is a big hammer that has all kind of negative performance implications beyond the evaluation of floating-point expressions. I would not recommend it, but there is no particular danger in it other than code running slowly.