Different results for cub::DeviceSelect::If

Hi all,

I’m encountering different behavior of this test code, depending on the platform that it runs:

#include <cub/device/device_select.cuh>
#include <cub/iterator/counting_input_iterator.cuh>


struct LessThan {
   int compare;

   __host__ __device__ __forceinline__
   LessThan(int compare): compare(compare) {}

   __host__ __device__ __forceinline__
   bool operator()(const int &a) const {
      return (a < compare);
   }
};

__global__ void set_num_selected_out (int *x) {
   *x = 1234;
}

int main (int argc, char *argv[]) {
   int num_items = 8;
   int h_in[num_items] = {0, 2, 3, 9, 5, 2, 81, 8};
   int *d_in;
   cudaMalloc((void**)&d_in, num_items * sizeof(int));
   cudaMemcpy(d_in, h_in, num_items * sizeof(int), cudaMemcpyHostToDevice);
   int *d_out;
   cudaMalloc((void**)&d_out, num_items * sizeof(int));
   int *d_num_selected_out;
   cudaMalloc((void**)&d_num_selected_out, sizeof(int));
   LessThan select_op(7);

   void *d_temp_storage = NULL;
   size_t temp_storage_bytes = 0;

   cub::DeviceSelect::If(
     d_temp_storage, temp_storage_bytes,
     d_in, d_out, d_num_selected_out, num_items, select_op);

   printf ("Error: %s\n", cudaGetErrorString(cudaGetLastError()));
   printf ("temp_storage_bytes: %d\n", temp_storage_bytes);
} 

I have two test systems: An x86 host with A100 GPUs and a Grace-Hopper(H100) system. On the first one, I use the HPC SDK 24.3 module (NVIDIA HPC SDK 24.3 Release | NVIDIA Developer). I compile with

nvcc test_cub.cu -o test.x`

When I run this I get

Error: no error
temp_storage_bytes: 767

So far so good. On the Grace-Hopper system, I download the SDK from the same location as above, but obviously the ARM version. I compile it in the same way and get

Error: no error
temp_storage_bytes: 0

I tried some previous SDK versions, and at least with version 23.3, the results agree. So it’s nothing on GH100 that’s making trouble per se, maybe it’s a regression?

why is anything making trouble?

It seems plausible to me that in 23.3 timeframe there was no difference in the implementation, and in 24.3 there was.

Let me be more precise. The issue is that version 23.3 behaves identically for both A100 and GH100, but in version 24.3, the results do not match:

SDK Version A100 GH100
23.3 767 767
24.3 767 0

The results should be identical for both platforms.

I don’t know how you reached that conclusion.

apart from the issue with temp storage, if you run an actual cub::DeviceSelect::If operation (lets say on 24.3, on H100), does the selection operation work, or not?

The results should be the same because they do the same thing: Filtering all the elements of the array which are less than 7. The space required for storing these elements should not be zero.

Unless there is a difference in cub implementation when it is running on an A100 vs. when it is running on a H100. Which is why I asked about the thing that matters: if the cub operation (not the temp space calculation) actually works, or not.

I have added the subsequent call which actually filters out the array to this example:

   cudaMalloc((void**)&d_temp_storage, temp_storage_bytes);

   cub::DeviceSelect::If (d_temp_storage, temp_storage_bytes, d_in, d_out,
                          d_num_selected_out, num_items, select_op);

   int h_num_selected_out;
   cudaMemcpy(&h_num_selected_out, d_num_selected_out, sizeof(int), cudaMemcpyDeviceToHost);
   printf ("num_selected_out: %d\n", h_num_selected_out);

On A100, the result is as expected (five numbers are smaller than seven):

Error: no error
temp_storage_bytes: 767
num_selected_out: 5

On GH100, the result is 0:

Error: no error
temp_storage_bytes: 0
num_selected_out: 0

This definitely shouldn’t be like that, or am I mistaken?

No I don’t think that should happen. My suggestion is to file a bug.

1 Like

The solution is that CUB failed to work due to a deprecated driver version. compute-sanitizer revealed

the provided PTX was compiled with an unsupported toolchain

With -sm=arch_90, the output is in agreement with the A100 system.
I have checked for the CUDA error code in the test program, but cudaGetLastError is in use by CUB. To check the error value properly, you need to use the return value of that function.