Weird bug involving the way to pass parameters to kernels

__global__ void kTest(GridInfo gridInfo, int3 test, unsigned int* res, bool cond) {
  *res = getCellIdx(gridInfo, test.x, test.y, test.z, cond);
}

void test(GridInfo gridInfo) {
  int3 test = make_int3(283, 10, 418);

  unsigned int* d_res;
  cudaMalloc(reinterpret_cast<void**>(&d_res),
               sizeof(unsigned int) );

  kTest<<<1, 1>>> (gridInfo, test, d_res, true);
  unsigned int h_res;
  cudaMemcpy(
              reinterpret_cast<void*>( &h_res ),
              d_res,
              sizeof( unsigned int ),
              cudaMemcpyDeviceToHost
  );
  printf("%d\n", h_res);
}

I encountered a weird bug involving passing parameters to kernel functions. kTest is the kernel function, and the last parameter is a bool variable. The kernel function kTest calls a __device__ function getCellIdx, whose last parameter is basically the bool passed in from the kernel.

What I found is that if I pass true to the kernel function (like the code shown above), the result is incorrect. But if I directly set the cond parameter in the getCellIdx function (i.e., *res = getCellIdx(gridInfo, test.x, test.y, test.z, true);), then I get the correct result. Function-wise, both are equivalent, right? Or am I missing something here?

I have this bug on an RTX 2080Ti GPU, nvcc V10.0.130, Driver Version: 470.42.01 and CUDA Version: 11.4, but didn’t have this issue on another machine with a RTX 2080, nvcc V11.3.109, Driver Version: 465.19.01 and CUDA Version: 11.3.

Yes, they look equivalent to me.

It’s possible its fixed with a newer CUDA version. Impossible to be sure because you’ve not indicated how you are compiling.

If you want futher help my suggestion would be to provide a complete test case. It doesn’t look like it would require much more code than what you have shown already.

Technically, %d isn’t the right format string for an unsigned quantity, but that doesn’t seem likely to be the issue here.

The rest of the code is quite involved, but I pasted them here. I know there are a few unit/int mix-ups, and I can fix them, but the numbers involved are quite small and are all positive so that won’t be an issue. I’ve also seen cases where passing int3 to a kernel function has a different result that passing three ints and then make_int3 in the kernel.

I should also say that this weird bug goes away when I do printf in kernel, but there is absolutely no shared data across threads, and there is no synchronization involved.

struct GridInfo
{
  float3 GridMin;
  unsigned int ParticleCount;
  float3 GridDelta;
  uint3 GridDimension;
  uint3 MetaGridDimension;
  unsigned int meta_grid_dim;
  unsigned int meta_grid_size;
};

__host__ __device__ inline uint Part1By2(uint x)
{
        x &= 0x000003ff;                  // x = ---- ---- ---- ---- ---- --98 7654 3210
        x = (x ^ (x << 16)) & 0xff0000ff; // x = ---- --98 ---- ---- ---- ---- 7654 3210
        x = (x ^ (x << 8)) & 0x0300f00f; // x = ---- --98 ---- ---- 7654 ---- ---- 3210
        x = (x ^ (x << 4)) & 0x030c30c3; // x = ---- --98 ---- 76-- --54 ---- 32-- --10
        x = (x ^ (x << 2)) & 0x09249249; // x = ---- 9--8 --7- -6-- 5--4 --3- -2-- 1--0
        return x;
}

__host__ __device__ inline uint MortonCode3(uint x, uint y, uint z)
{
        return (Part1By2(z) << 2) + (Part1By2(y) << 1) + Part1By2(x);
}

__host__ __device__ inline uint CellIndicesToLinearIndex(
        uint3 &cellDimensions,
        uint3 &xyz
)
{
        return xyz.z * cellDimensions.y * cellDimensions.x + xyz.y * cellDimensions.x + xyz.x;
}

inline __host__ __device__ uint ToCellIndex_MortonMetaGrid(const GridInfo &GridInfo, int3 gridCell)
{
  int3 metaGridCell = make_int3(
    gridCell.x / GridInfo.meta_grid_dim,
    gridCell.y / GridInfo.meta_grid_dim,
    gridCell.z / GridInfo.meta_grid_dim);

  gridCell.x %= GridInfo.meta_grid_dim;
  gridCell.y %= GridInfo.meta_grid_dim;
  gridCell.z %= GridInfo.meta_grid_dim;
  uint metaGridIndex = CellIndicesToLinearIndex(GridInfo.MetaGridDimension, metaGridCell);

  return metaGridIndex * GridInfo.meta_grid_size + MortonCode3(gridCell.x, gridCell.y, gridCell.z);
}

inline __host__ __device__
unsigned int getCellIdx(GridInfo gridInfo, int ix, int iy, int iz, bool morton) {
  if (morton) // z-order sort
    return ToCellIndex_MortonMetaGrid(gridInfo, make_int3(ix, iy, iz));
  else // raster order
    return (ix * gridInfo.GridDimension.y + iy) * gridInfo.GridDimension.z + iz;
}

You still haven’t provided a complete code.

Again, if the issue is not reproducible with newer nvcc versions, then just switch to a newer version. Nobody is going to fix anything in nvcc from CUDA 10 at this point.

If you wish to provide a complete code, I’ll take a look as time permits.

If you’re not sure how to provide a complete code, then do this:

  • pretend you are me
  • start with a clean, empty project
  • add only the things that are actually posted in this question
  • compile the code
  • does it compile correctly? If not keep adding things to this posting until the code in this posting will compile
  • once the code compiles, run it, and make sure the output gives me a way to understand if it is working correctly or not. If you need command line parameters, specify those. If you need input files, I probably won’t look at your code. Redesign the test case to eliminate the need for input files.

It’s preferable, of course, that you minimize the code, while still following the above steps, so as to reduce the code to the minimum necessary to see the issue.

Again, these are just suggestions. Do as you wish.