Kernel returns incorrect result with CUDA 12.6+

bertin1 · October 4, 2025, 2:41am

I’m experiencing a weird bug where a kernel returns incorrect results with newer versions of CUDA (12.6+), but correct results with older versions (CUDA<=12.2). The issue disappears with CUDA12.6+ if I compile in debug mode with flags -Xptxas -O0, which suggests a compiler optimization issue.

I have created a minimal reproducer at GitHub - nrbertin/kernel-bug. The code is based on Kokkos and relies on placing an object on shared memory.

The reproducer contains 2 kernels in a functor: Kernel1 with Tag1 and Kernel 2 with Tag 2. Both kernels are identical but call different versions of a bar method:

struct Functor
{
    Foo* foo;
    
    Functor(Foo* _foo) : foo(_foo) {}

    // Kernel 1 calling bar1
    KOKKOS_INLINE_FUNCTION
    void operator() (Foo::Tag1, int i, int& sum) const
    {
        int id = foo->get_id(i);
        int bar = foo->bar1(id);
        sum += abs(bar);
    }

    // Kernel 2 calling bar2
    KOKKOS_INLINE_FUNCTION
    void operator() (Foo::Tag2, int i, int& sum) const
    {
        int id = foo->get_id(i);
        int bar = foo->bar2(id);
        sum += abs(bar);
    }  
};

bar1 and bar2 are identical; the only difference is that inlining is explicitly disabled for bar1:

class Foo {
public:
    Foo() {}
    
    int maxid = 50;

    KOKKOS_INLINE_FUNCTION
    int get_id(const int& i) const
    {
        return max(min(i, maxid), 0);
    }

    __noinline__
    KOKKOS_INLINE_FUNCTION
    int bar1(const int& id) const
    {
        return (int)(id == maxid);
    }
    
    KOKKOS_INLINE_FUNCTION
    int bar2(const int& id) const
    {
        return (int)(id == maxid);
    }

    struct Tag1 {};
    struct Tag2 {};

    template<class Tag>
    int execute();
};

The 2 kernels compute exactly the same thing. What they do is to 1) clamp index i value to [0,50] in get_id() and, 2) check if the clamped index id is equal to maxid = 50. When launching the 2 kernels Tag1 and Tag2 for the range i = [0,9]:

int N = 10;
int sum = 0;
Kokkos::parallel_reduce(
    Kokkos::RangePolicy<Tag>(0, N),
    Functor(this), sum
);
Kokkos::fence();

the correct sum result is obviously 0 as all i values are below maxid = 50. This is indeed what I obtain with CUDA <= 12.2:

(base) bash-4.4$ ./kernel_bug 
Kernel Tag1: 0
Kernel Tag2: 0

However with CUDA 12.6+ (I tested with 12.6 and 12.9), the inlined kernel 2 returns the incorrect result of 10:

(base) bash-4.4$ ./kernel_bug 
Kernel Tag1: 0
Kernel Tag2: 10

as if instruction return (int)(id == maxid); in bar2 is incorrectly optimized to always return 1, without performing the run time comparison.

Topic		Replies	Views
Incosistent results - can't explain CUDA Programming and Performance	18	3173	May 10, 2010
Why same kernel function get different compilation result on the same machine CUDA Programming and Performance	2	722	May 16, 2012
CUDA Bug report CUDA Programming and Performance	5	6015	October 7, 2009
Strange behaviour of a kernel function CUDA Programming and Performance	2	2442	March 21, 2008
kernel runs fine under CUDA 1.0, fails under 1.1 CUDA Programming and Performance	1	2884	December 21, 2007
CUDA 2.3 bug? Strange compilation issue CUDA Programming and Performance	0	1851	September 5, 2009
Possible NVCC compiler bug Two 'breaks' in different loops :( CUDA Programming and Performance	3	5323	January 16, 2008
Strange behaviour. Execution failed probably bug in compiler CUDA Programming and Performance	2	3360	January 14, 2008
compiler crash in 2.2 CUDA Programming and Performance	6	4566	May 11, 2009
NVCC bug report: a runtime error CUDA Programming and Performance	7	6516	March 19, 2009

Kernel returns incorrect result with CUDA 12.6+

Related topics