I’m decomposing a kernel into several files, but some strange behavior occurred.
Code:
// In fused_arith_242_part0.cu - Template implementation
template <typename Field>
__device__ void part0(ConstPolyPtr const* vars, PolyPtr const* mut_vars, unsigned long long idx) {
Field tmp2 = make_slice_iter<Field>(vars[1])[idx];
Field tmp1 = make_slice_iter<Field>(vars[0])[idx];
Field tmp0 = tmp1 * tmp2; // This multiplication is incorrectly optimized out
make_slice_iter<Field>(mut_vars[0])[idx] = tmp0;
}
// In wrapper.cu - Ground truth implementation (works correctly)
template <typename Field>
__launch_bounds__(256) __global__ void fused_arith_242(ConstPolyPtr const* vars, PolyPtr const* mut_vars, unsigned long long len) {
unsigned long long idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= len) return;
auto tmp2 = make_slice_iter<Field>(vars[1])[idx];
auto tmp1 = make_slice_iter<Field>(vars[0])[idx];
auto tmp0 = tmp1 * tmp2;
make_slice_iter<Field>(mut_vars[0])[idx] = tmp0;
}
// modulized implementation which triggers the bug
template <typename Field>
__launch_bounds__(256) __global__ void fused_arith_242(ConstPolyPtr const* vars, PolyPtr const* mut_vars, unsigned long long len) {
unsigned long long idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= len) return;
part0<Field>(
vars, mut_vars, idx
);
}
Issue Description: A complex optimization issue has been identified that depends on both template usage and code organization:
-
Code Organization Dependencies:
a. Working Cases:
- Non-template implementation
- Template implementation in a single file
- Template implementation with int type
b. Failing Case:
- Template implementation split across files
- Using bn254::Element as template parameter (large number computation)
-
Type Dependency:
- int type: Works in all cases
- bn254::Element: Fails with split template implementation
-
Optimization Behavior:
-
In split template implementation:
Field tmp1 = make_slice_iter<Field>(vars[0])[idx]; Field tmp0 = tmp1 * tmp2; // Incorrectly optimized out // when inputing with [1, 2, 3, 4] and [2, 3, 4, 5] (all in big number format) // the output is [1, 3, 4, 5] // but the truth is [2, 6, 12, 20]
-
-
Verification Methods: Working Solutions:
// Solution 1: Non-template implementation auto tmp1 = make_slice_iter<bn254::Element>(vars[0])[idx]; auto tmp0 = tmp1 * tmp2; // Solution 2: Add printf Field tmp1 = make_slice_iter<Field>(vars[0])[idx]; printf("tmp1: %d\n", tmp1.to_number().limbs[0]); Field tmp0 = tmp1 * tmp2;
Impact:
- Silent failure in cryptographic computations
- Affects only complex field types in split template implementations
- No compilation errors or runtime crashes
Workarounds:
- Keep template implementation in a single file
- Use non-template implementation
- Add printf statements to prevent optimization
I think this maybe a bug caused by nvcc? Or is it caused by other known issues?