I’m experiencing a weird bug where a kernel returns incorrect results with newer versions of CUDA (12.6+), but correct results with older versions (CUDA<=12.2). The issue disappears with CUDA12.6+ if I compile in debug mode with flags -Xptxas -O0, which suggests a compiler optimization issue.
I have created a minimal reproducer at GitHub - nrbertin/kernel-bug. The code is based on Kokkos and relies on placing an object on shared memory.
The reproducer contains 2 kernels in a functor: Kernel1 with Tag1 and Kernel 2 with Tag 2. Both kernels are identical but call different versions of a bar method:
struct Functor
{
Foo* foo;
Functor(Foo* _foo) : foo(_foo) {}
// Kernel 1 calling bar1
KOKKOS_INLINE_FUNCTION
void operator() (Foo::Tag1, int i, int& sum) const
{
int id = foo->get_id(i);
int bar = foo->bar1(id);
sum += abs(bar);
}
// Kernel 2 calling bar2
KOKKOS_INLINE_FUNCTION
void operator() (Foo::Tag2, int i, int& sum) const
{
int id = foo->get_id(i);
int bar = foo->bar2(id);
sum += abs(bar);
}
};
bar1 and bar2 are identical; the only difference is that inlining is explicitly disabled for bar1:
class Foo {
public:
Foo() {}
int maxid = 50;
KOKKOS_INLINE_FUNCTION
int get_id(const int& i) const
{
return max(min(i, maxid), 0);
}
__noinline__
KOKKOS_INLINE_FUNCTION
int bar1(const int& id) const
{
return (int)(id == maxid);
}
KOKKOS_INLINE_FUNCTION
int bar2(const int& id) const
{
return (int)(id == maxid);
}
struct Tag1 {};
struct Tag2 {};
template<class Tag>
int execute();
};
The 2 kernels compute exactly the same thing. What they do is to 1) clamp index i value to [0,50] in get_id() and, 2) check if the clamped index id is equal to maxid = 50. When launching the 2 kernels Tag1 and Tag2 for the range i = [0,9]:
int N = 10;
int sum = 0;
Kokkos::parallel_reduce(
Kokkos::RangePolicy<Tag>(0, N),
Functor(this), sum
);
Kokkos::fence();
the correct sum result is obviously 0 as all i values are below maxid = 50. This is indeed what I obtain with CUDA <= 12.2:
(base) bash-4.4$ ./kernel_bug
Kernel Tag1: 0
Kernel Tag2: 0
However with CUDA 12.6+ (I tested with 12.6 and 12.9), the inlined kernel 2 returns the incorrect result of 10:
(base) bash-4.4$ ./kernel_bug
Kernel Tag1: 0
Kernel Tag2: 10
as if instruction return (int)(id == maxid); in bar2 is incorrectly optimized to always return 1, without performing the run time comparison.