Why Are My CUDA Compilation Results Different with Direct and Separate Compilation?

I have developed a very large kernel and have two files, A.cu and B.cu File A.cu contains the kernel implementation and uses a function my_func (which is not implemented in A.cu). I have implemented the my_func function in B.cu which uses a large number of registers and local memory. I obtained different results using these two compilation methods:

  1. nvcc A.cu B.cu -arch=xxx
  2. nvcc -dlto -rdc=true -c A.cu -o A.o then nvcc A.o B.cu -rdc=true

Additionally, this issue is difficult to reproduce and only occurs when the kernel is very large. I have also examined the generated SASS code and found that some local memory, which is used later in the program, is being overwritten even though this data is not stored in registers.

Can someone help explain why these two methods produce different results?

@Robert_Crovella Could you help me? Thank you very much!

What exactly do you mean by different results? Can you show representative examples of how the outputs differ with the two compilation modes? If these are results from a numerical computation, how big are the differences? Is this floating-point computation or it integer based? Are atomic operations involved? Have you checked for out-of-bound accesses and race conditions in your code with compute-sanitizer?

Assumption: This is floating-point computation, and smallish differences are observed in the outputs depending on compilation mode. No issues are reported by compute sanitizer. No atomics are being used.

Hypothesis: Floating-point arithmetic is not associative. Therefore, the CUDA compiler does not re-associate floating-point computation, with the exception of merging FMUL with dependent FADD into FMA (fused multiply-add). Generally speaking this likely improves both performance and accuracy of the computation.

FMA contraction proceeds according to compiler heuristics and one cannot know with certainty how it is applied to an expression like ab+cd. Depending on compilation mode (with and without link-time optimization) the expression tree prior to the FMA-contraction stage can look differently, causing FMA contraction to be applied to different places. Due to non-associativiy of floating-point arithmetic, this causes slight numerical changes locally (they might grow bigger due to subsequent computation until final results are produced).

To refute this hypothesis, repeat your experiment with -fmad=false to turn off FMA contraction. Are there still differences in the output between compilation modes? If so, form a new hypothesis.

It is certainly possible for a compiler bug to be in play here, but generally speaking such bugs are rare, and more immediate explanations based on properties of the code itself are more likely to be the root cause of the observations.

I don’t know off the top of my head, but doesn’t the second invocation of nvcc here also need -dlto?

Or are you talking about different results in regards to the instructions or even performance is different, but the output is the same?

Your first command specifies no device linking. Therefore it could not possibly work if a __global__ function in A.cu is calling a __device__ function in B.cu, which seems to be what you are saying (unless you are also doing something unexpected like #include - ing B.cu in A.cu).

Your question doesn’t make sense to me.