Why Are My CUDA Compilation Results Different with Direct and Separate Compilation?

qiji · August 23, 2024, 2:51am

I have developed a very large kernel and have two files, A.cu and B.cu File A.cu contains the kernel implementation and uses a function my_func (which is not implemented in A.cu). I have implemented the my_func function in B.cu which uses a large number of registers and local memory. I obtained different results using these two compilation methods:

nvcc A.cu B.cu -arch=xxx
nvcc -dlto -rdc=true -c A.cu -o A.o then nvcc A.o B.cu -rdc=true

Additionally, this issue is difficult to reproduce and only occurs when the kernel is very large. I have also examined the generated SASS code and found that some local memory, which is used later in the program, is being overwritten even though this data is not stored in registers.

Can someone help explain why these two methods produce different results?

qiji · August 23, 2024, 2:56am

@Robert_Crovella Could you help me? Thank you very much!

njuffa · August 23, 2024, 3:36am

What exactly do you mean by different results? Can you show representative examples of how the outputs differ with the two compilation modes? If these are results from a numerical computation, how big are the differences? Is this floating-point computation or it integer based? Are atomic operations involved? Have you checked for out-of-bound accesses and race conditions in your code with compute-sanitizer?

Assumption: This is floating-point computation, and smallish differences are observed in the outputs depending on compilation mode. No issues are reported by compute sanitizer. No atomics are being used.

Hypothesis: Floating-point arithmetic is not associative. Therefore, the CUDA compiler does not re-associate floating-point computation, with the exception of merging FMUL with dependent FADD into FMA (fused multiply-add). Generally speaking this likely improves both performance and accuracy of the computation.

FMA contraction proceeds according to compiler heuristics and one cannot know with certainty how it is applied to an expression like ab+cd. Depending on compilation mode (with and without link-time optimization) the expression tree prior to the FMA-contraction stage can look differently, causing FMA contraction to be applied to different places. Due to non-associativiy of floating-point arithmetic, this causes slight numerical changes locally (they might grow bigger due to subsequent computation until final results are produced).

To refute this hypothesis, repeat your experiment with -fmad=false to turn off FMA contraction. Are there still differences in the output between compilation modes? If so, form a new hypothesis.

It is certainly possible for a compiler bug to be in play here, but generally speaking such bugs are rare, and more immediate explanations based on properties of the code itself are more likely to be the root cause of the observations.

I don’t know off the top of my head, but doesn’t the second invocation of nvcc here also need -dlto?

Curefab · August 23, 2024, 9:07am

Or are you talking about different results in regards to the instructions or even performance is different, but the output is the same?

Robert_Crovella · August 23, 2024, 1:31pm

Your first command specifies no device linking. Therefore it could not possibly work if a __global__ function in A.cu is calling a __device__ function in B.cu, which seems to be what you are saying (unless you are also doing something unexpected like #include - ing B.cu in A.cu).

Your question doesn’t make sense to me.

Topic		Replies	Views
Same code, same input, different results CUDA Programming and Performance cuda	5	457	September 6, 2023
Why same kernel fun get different compilation result on the same machine CUDA Programming and Performance	0	517	May 16, 2012
different results (cuda\fortran) CUDA Programming and Performance	3	892	July 9, 2013
Different results in Debug and Release mode compile CUDA Programming and Performance	9	4231	October 20, 2015
strange floating-point accuracy prolem CUDA Programming and Performance	2	743	July 18, 2011
Why same kernel function get different compilation result on the same machine CUDA Programming and Performance	2	741	May 16, 2012
Why these two scripts give different results? CUDA Programming and Performance	5	1862	June 19, 2012
Is CUDA's implementation of 64-bit floating precision in practice subpar to that of Fortran? CUDA Programming and Performance	2	1267	December 15, 2021
Why is the result different when I compile with release version CUDA Programming and Performance	6	622	November 24, 2020
Floating point operations IEE compliance and debug mode CUDA Programming and Performance	3	1056	April 4, 2013

Why Are My CUDA Compilation Results Different with Direct and Separate Compilation?

Related topics