Strange bug in nvcc v12.4 when dividing template kernels into several files

I’m decomposing a kernel into several files, but some strange behavior occurred.
Code:

// In fused_arith_242_part0.cu - Template implementation
template <typename Field>
__device__ void part0(ConstPolyPtr const* vars, PolyPtr const* mut_vars, unsigned long long idx) {
    Field tmp2 = make_slice_iter<Field>(vars[1])[idx];
    Field tmp1 = make_slice_iter<Field>(vars[0])[idx];
    Field tmp0 = tmp1 * tmp2;  // This multiplication is incorrectly optimized out
    make_slice_iter<Field>(mut_vars[0])[idx] = tmp0;
}

// In wrapper.cu - Ground truth implementation (works correctly)
template <typename Field>
__launch_bounds__(256) __global__ void fused_arith_242(ConstPolyPtr const* vars, PolyPtr const* mut_vars, unsigned long long len) {
    unsigned long long idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= len) return;
    auto tmp2 = make_slice_iter<Field>(vars[1])[idx];
    auto tmp1 = make_slice_iter<Field>(vars[0])[idx];
    auto tmp0 = tmp1 * tmp2;
    make_slice_iter<Field>(mut_vars[0])[idx] = tmp0;
}
// modulized implementation which triggers the bug
template <typename Field>
__launch_bounds__(256) __global__ void fused_arith_242(ConstPolyPtr const* vars, PolyPtr const* mut_vars, unsigned long long len) {
    unsigned long long idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= len) return;
    part0<Field>(
        vars, mut_vars, idx
    );
}

Issue Description: A complex optimization issue has been identified that depends on both template usage and code organization:

  1. Code Organization Dependencies:

    a. Working Cases:

    • Non-template implementation
    • Template implementation in a single file
    • Template implementation with int type

    b. Failing Case:

    • Template implementation split across files
    • Using bn254::Element as template parameter (large number computation)
  2. Type Dependency:

    • int type: Works in all cases
    • bn254::Element: Fails with split template implementation
  3. Optimization Behavior:

    • In split template implementation:

      Field tmp1 = make_slice_iter<Field>(vars[0])[idx];
      Field tmp0 = tmp1 * tmp2;  // Incorrectly optimized out
      // when inputing with [1, 2, 3, 4] and [2, 3, 4, 5] (all in big number format)
      // the output is [1, 3, 4, 5]
      // but the truth is [2, 6, 12, 20]
      
  4. Verification Methods: Working Solutions:

    // Solution 1: Non-template implementation
    auto tmp1 = make_slice_iter<bn254::Element>(vars[0])[idx];
    auto tmp0 = tmp1 * tmp2;
    
    // Solution 2: Add printf
    Field tmp1 = make_slice_iter<Field>(vars[0])[idx];
    printf("tmp1: %d\n", tmp1.to_number().limbs[0]);
    Field tmp0 = tmp1 * tmp2;
    

Impact:

  • Silent failure in cryptographic computations
  • Affects only complex field types in split template implementations
  • No compilation errors or runtime crashes

Workarounds:

  1. Keep template implementation in a single file
  2. Use non-template implementation
  3. Add printf statements to prevent optimization

I think this maybe a bug caused by nvcc? Or is it caused by other known issues?

Ah, I found the problem. It seems that when there are multiple template-based functions with the same name, the linker won’t raise an error, but, all these functions are instantiated with the same template parameter, and the linking causes the mistake. But I’m still confused why the linker won’t raise an error?