Different results for cub::DeviceSelect::If

christian.weiss · April 3, 2024, 12:51pm

Hi all,

I’m encountering different behavior of this test code, depending on the platform that it runs:

#include <cub/device/device_select.cuh>
#include <cub/iterator/counting_input_iterator.cuh>


struct LessThan {
   int compare;

   __host__ __device__ __forceinline__
   LessThan(int compare): compare(compare) {}

   __host__ __device__ __forceinline__
   bool operator()(const int &a) const {
      return (a < compare);
   }
};

__global__ void set_num_selected_out (int *x) {
   *x = 1234;
}

int main (int argc, char *argv[]) {
   int num_items = 8;
   int h_in[num_items] = {0, 2, 3, 9, 5, 2, 81, 8};
   int *d_in;
   cudaMalloc((void**)&d_in, num_items * sizeof(int));
   cudaMemcpy(d_in, h_in, num_items * sizeof(int), cudaMemcpyHostToDevice);
   int *d_out;
   cudaMalloc((void**)&d_out, num_items * sizeof(int));
   int *d_num_selected_out;
   cudaMalloc((void**)&d_num_selected_out, sizeof(int));
   LessThan select_op(7);

   void *d_temp_storage = NULL;
   size_t temp_storage_bytes = 0;

   cub::DeviceSelect::If(
     d_temp_storage, temp_storage_bytes,
     d_in, d_out, d_num_selected_out, num_items, select_op);

   printf ("Error: %s\n", cudaGetErrorString(cudaGetLastError()));
   printf ("temp_storage_bytes: %d\n", temp_storage_bytes);
}

I have two test systems: An x86 host with A100 GPUs and a Grace-Hopper(H100) system. On the first one, I use the HPC SDK 24.3 module (NVIDIA HPC SDK 24.3 Release | NVIDIA Developer). I compile with

nvcc test_cub.cu -o test.x`

When I run this I get

Error: no error
temp_storage_bytes: 767

So far so good. On the Grace-Hopper system, I download the SDK from the same location as above, but obviously the ARM version. I compile it in the same way and get

Error: no error
temp_storage_bytes: 0

I tried some previous SDK versions, and at least with version 23.3, the results agree. So it’s nothing on GH100 that’s making trouble per se, maybe it’s a regression?

Robert_Crovella · April 3, 2024, 3:04pm

why is anything making trouble?

It seems plausible to me that in 23.3 timeframe there was no difference in the implementation, and in 24.3 there was.

christian.weiss · April 4, 2024, 6:30am

Let me be more precise. The issue is that version 23.3 behaves identically for both A100 and GH100, but in version 24.3, the results do not match:

SDK Version	A100	GH100
23.3	767	767
24.3	767	0

The results should be identical for both platforms.

Robert_Crovella · April 4, 2024, 2:03pm

I don’t know how you reached that conclusion.

apart from the issue with temp storage, if you run an actual cub::DeviceSelect::If operation (lets say on 24.3, on H100), does the selection operation work, or not?

christian.weiss · April 4, 2024, 2:08pm

The results should be the same because they do the same thing: Filtering all the elements of the array which are less than 7. The space required for storing these elements should not be zero.

Robert_Crovella · April 4, 2024, 2:32pm

Unless there is a difference in cub implementation when it is running on an A100 vs. when it is running on a H100. Which is why I asked about the thing that matters: if the cub operation (not the temp space calculation) actually works, or not.

christian.weiss · April 5, 2024, 11:11am

I have added the subsequent call which actually filters out the array to this example:

   cudaMalloc((void**)&d_temp_storage, temp_storage_bytes);

   cub::DeviceSelect::If (d_temp_storage, temp_storage_bytes, d_in, d_out,
                          d_num_selected_out, num_items, select_op);

   int h_num_selected_out;
   cudaMemcpy(&h_num_selected_out, d_num_selected_out, sizeof(int), cudaMemcpyDeviceToHost);
   printf ("num_selected_out: %d\n", h_num_selected_out);

On A100, the result is as expected (five numbers are smaller than seven):

Error: no error
temp_storage_bytes: 767
num_selected_out: 5

On GH100, the result is 0:

Error: no error
temp_storage_bytes: 0
num_selected_out: 0

This definitely shouldn’t be like that, or am I mistaken?

Robert_Crovella · April 5, 2024, 2:34pm

No I don’t think that should happen. My suggestion is to file a bug.

christian.weiss · April 15, 2024, 9:57am

The solution is that CUB failed to work due to a deprecated driver version. compute-sanitizer revealed

the provided PTX was compiled with an unsupported toolchain

With -sm=arch_90, the output is in agreement with the A100 system.
I have checked for the CUDA error code in the test program, but cudaGetLastError is in use by CUB. To check the error value properly, you need to use the return value of that function.

system · April 29, 2024, 9:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUB complex number reduction iterator_traits problem GPU-Accelerated Libraries thrust , cub	4	762	September 29, 2022
Debugging device code does not work CUDA Programming and Performance	7	2889	July 11, 2013
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3320	April 1, 2008
CUB ReduceByKey unexpected execution times GPU-Accelerated Libraries cub	4	410	February 3, 2024
Performance drop after specifying CUDA_VISIBLE_DEVICES=0 CUDA Programming and Performance cuda	6	327	April 5, 2024
CPU hangs when calling thrust::copy_if CUDA Programming and Performance	14	2569	August 10, 2015
Certain samples fail with 'no device supporting CUDA' CUDA Programming and Performance	4	12465	January 26, 2009
Slow CUDA SGEMM CUDA Programming and Performance	5	647	September 15, 2022
Correct output with emulation mode, wrong with GPU/Execution CUDA Programming and Performance	6	3324	March 25, 2010
Problems with CUDA CUDA Programming and Performance	8	2836	December 3, 2012

Different results for cub::DeviceSelect::If

Related topics