Same JIT program running on Kepler and Maxwell generate different result


Currently we are trying to upgrade the GPU from K20 (Kepler Architecture) to M5000 (Maxwell Architecture) for image processing application. However, with the same input, the output is different. The CUDA program involves floating point number computation. It is compiled into PTX and loaded at run time.

As I searched the forum, looks like different architecture may lead to different results if floating point number calculation is involved. e.g.

Since I’m new to CUDA, could some expert confirm/explain how this can happen? What’s the expectation of these difference?

Many thanks in advance!

Floating point operations don’t satisfy certain order-of-operations identities that we expect for integer operations. For example, with floating point,

(a+b)+c is not necessarily equal to a+(b+c)

Since GPUs sometimes do operations like this “in parallel”, the order of operations may vary from run to run, even with the same binary, but certainly may vary if the compiler (version) is changed, different optimization levels are specified, or a different GPU architecture is targetted (which will happen when the JIT process is used on different GPUs).

This floating point order of operations phenomenon is discussed in many places, such as here:

Normally, the differences are small, but people looking for bitwise identical results may not find them.

Please confirm that these are the results from a controlled experiment in which only a single variable was changed: Other than changing the GPU, the system is completely unchanged, both with regard to hardware and to software.

If so, I can think of three possibilities off the top of my head: (1) Race conditions in your code. cuda-memcheck can help you find some of those. (2) Use of atomic operations with floating-point operands. Floating-point arithmetic is not associative, and the order of operations is indeterminate when using atomics. (3) Architecture-specific compiler differences in the compilation from PTX to machine code (SASS). For example, the Kepler and Maxwell backend optimizers may apply different FMA contractions.

txbob and njuffa,

Thanks for the quick response! I will digest the article as txbob suggested.

njuffa, The experiment is in a controlled envirionment – there is no change of the software and hardware, except different GPUs. Regarding cuda-memcheck, what is it? Is it a nvcc compiler option? Is there any place/document describe the difference between Kepler and Maxwell backend optimizer while compiling PTX to SASS?

Did you try googling “cuda-memcheck” ?

Detailed internals of the CUDA toolchain are not publicly documented, but the fact that Kepler and Maxwell architectures are not binary compatible requires there to be architecture-specific code generators in the PTX to SASS conversion.

Various architecture-specific code generation bugs in PTX to SASS compilation discussed in these forums over the years strongly suggest that there are architecture-specific optimizing code transformations in PTXAS. That this may extend to different ways to contract FADD and FMUL into FMA is conjecture on my part. You can turn off FMA contraction completely by specifying -fmad=false when you generate the PTX code with nvcc, and check whether that eliminates the output differences between Kepler and Maxwell. Note that disabling FMA contraction could have a significant negative impact on code performance, and it will likely cause output differences with variants compiled with FMA contraction enabled.

Other than applying FMA contraction (and in contrast to many x86 CPU compilers), the CUDA toolchain handles floating-point expressions conservatively and does not typically re-associate such expressions, so the only other place where floating-point associativity issues creep in is the use of floating-point atomics.

Another potential source of numerical differences between different GPU architectures may be inside CUDA libraries, at least some of which have architecture-specific code paths that may use different orders of operations to accomplish the same computation.