CUB ReduceByKey unexpected execution times

Just trying to learn CUB, and to see how it performs under different loads for a future project.

I’m measuring the execution time for the DeviceReduce::ReduceByKey method, but the times i’m getting seem strange. I probably have an error with how i measure.

The first run that returns the amount of memory it needs allocated costs ~10x the actual run. Is that normal?
Also execution time hardly changes with data volume.

On 180M entries i get 0.927ms for the mem requirement run, and 0.092ms for the reduction.
On 18M items it’s 0.656ms for mem requirement, and 0.079ms for reduction.
On 190 items 0.583ms for mem, 0.066ms for reduce.

I’m just adding my code to a CUDA Sample project (simpleGL), so using the sdk’s StopWatchTimer… perhaps incorrectly?

  StopWatchInterface *stopWatch = NULL;
  
  sdkCreateTimer(&stopWatch);
  sdkStartTimer(&stopWatch);
  
  // Get mem requirement
  CubDebugExit(DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items));
  
  sdkStopTimer(&stopWatch);
  float kernelTime = sdkGetTimerValue(&stopWatch);

  CubDebugExit(g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes));

  printf("DeviceReduce %zd bytes allocated..\ntime %fms\n", temp_storage_bytes, kernelTime);

  sdkResetTimer(&stopWatch);
  sdkStartTimer(&stopWatch);

  // Run reduce-by-key
  CubDebugExit(DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items));

  sdkStopTimer(&stopWatch);

  kernelTime = sdkGetTimerValue(&stopWatch);

  // later
  printf("\ntime %fms\nkeys: %d\n", kernelTime, keyCount);

I’m not sure what you find strange. Smaller sizes have smaller execution times. Lack of pure linear scaling probably implies some overheads.

You might wish to learn to use a profiler such as nsight systems.

Yes, device memory allocations can take much, much longer than a CUDA kernel run. To avoid this, run the test multiple times, reusing the previous allocation.

Thanks for your response!

Mem alloc was excluded from the timing if you look the code i posted. Stopped the timer before and reset afterwards. Unless i made a mistake there…?

Anyway, the first time value must have included some initialization.
I added a very low cost dummy operation to the beginning - now that takes ~1ms.
The same mem alloc call is now less then 0.01ms. Looks way better.

And yes, the scaling was very surprising.
Is there a way to improve performance for smaller datasets?
Can usage of block wide functions help this? Or telling CUDA about max threads per block or min block count i’ll be using?

CUDA has various overheads. In order to remove all of these I suggest running a function twice in order to determine best-case perf estimate.

When I do a careful job, for 190 items on CUDA 12.2 on a L4 GPU, I observe about 22 microseconds. If you are expecting something much better than 22 microseconds, a GPU may not meet your needs; the overhead of an empty kernel launch is about 5 microseconds.

If you have many smaller datasets that all need to be reduced at the same time, you could use block reduce, giving each block one data set to reduce. With reduce-by-key, however, that may also perhaps be done just by proper concatenation of the datasets, and a single reduce-by-key.

# cat t129.cu
#include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

// CustomMin functor
  struct CustomMin
  {
    template <typename T>
    __host__ __device__  __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
  };
int main(int argc, char *argv[]){

// Declare, allocate, and initialize device-accessible pointers for input and output
  int          num_items = 190;          // e.g., 8
  if (argc > 1) num_items = atoi(argv[1]);
  int          *d_keys_in;         // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
  int          *d_values_in;       // e.g., [0, 7, 1, 6, 2, 5, 3, 4]
  int          *d_unique_out;      // e.g., [-, -, -, -, -, -, -, -]
  int          *d_aggregates_out;  // e.g., [-, -, -, -, -, -, -, -]
  int          *d_num_runs_out;    // e.g., [-]
  cudaMalloc(&d_keys_in, num_items*sizeof(*d_keys_in));
  cudaMalloc(&d_values_in, num_items*sizeof(*d_values_in));
  cudaMalloc(&d_unique_out, num_items*sizeof(*d_unique_out));
  cudaMalloc(&d_aggregates_out, num_items*sizeof(*d_aggregates_out));
  cudaMalloc(&d_num_runs_out, sizeof(*d_num_runs_out));
  CustomMin    reduction_op;
  // Determine temporary device storage requirements
  void     *d_temp_storage = NULL;
  size_t   temp_storage_bytes = 0;
  cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);
  // Allocate temporary storage
  cudaMalloc(&d_temp_storage, temp_storage_bytes);
  // warm-up Run reduce-by-key
  cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);
// d_unique_out      <-- [0, 2, 9, 5, 8]
// d_aggregates_out  <-- [0, 1, 6, 2, 4]
// d_num_runs_out    <-- [5]
  int *data = new int[num_items];
  for (int i = 0; i < num_items; i++) data[i] = rand()%10;
  cudaMemcpy(d_values_in, data, num_items*sizeof(*d_values_in), cudaMemcpyHostToDevice);
  for (int i = 0; i < num_items; i++) data[i] = i/5;
  cudaMemcpy(d_keys_in, data, num_items*sizeof(*d_keys_in), cudaMemcpyHostToDevice);
  unsigned long long dt = dtime_usec(0);
  cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  cudaError_t err = cudaGetLastError();
  if (err == cudaSuccess) std::cout << "Elapsed time: " << dt/(float)USECPSEC << "s" << std::endl;
  else std::cout << "Error: " << cudaGetErrorString(err) << std::endl;
}

# nvcc -o t129 t129.cu
# ./t129
Elapsed time: 2.2e-05s
# ./t129 10000
Elapsed time: 2.4e-05s
# ./t129 100000
Elapsed time: 7.4e-05s
#

I’m not able to reconcile my perf data with yours. Using the test above and your sizes, I get 22 microseconds for 190 elements and 840 microseconds for 18M elements. Your result of 79 microseconds for 18M elements seems pretty good to me, perhaps your GPU is faster than my L4.

Thanks for your effort!

This is concerning, seems i mess up time measurements quite a bit… by an order of magnitude in both directions.

Couldn’t find any comparisons L4 is so new and i don’t know much about HW anyway - but mine is just a gaming card (4070Ti). Guess yours should be much faster, but at least not slower.
Also your 22 microsec for 190 items makes much more sense compared to the rest then what i saw.

I’ll try to see what’s wrong with my timer… maybe i should use the OS apis instead.