Typename NumItemsT in cub::DeviceRadixSort::SortPairs()

Hello NVIDIA Developer Community,

I’ve encountered an unexpected behavior when using cub::DeviceRadixSort::SortPairs in CUDA. Specifically, the temporary storage bytes required (temp_storage_bytes) vary depending on whether I use int or size_t for the NumItemsT parameter, even though the value is the same.

Here’s a minimal example of my code:

#include <cub/cub.cuh>
#include <iostream>
#include <random>
#include <vector>

int main() {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(0.0f, 10.0f);

    int    num_items1 = 5000;
    size_t num_items2 = num_items1;
    std::cout << "num_items1: " << num_items1 << std::endl
              << "num_items2: " << num_items2 << std::endl;
    float  *d_keys_in;
    float  *d_keys_out;
    float  *d_values_in;
    float  *d_values_out;

    cudaMalloc(&d_keys_in, num_items1 * sizeof(float));
    cudaMalloc(&d_keys_out, num_items1 * sizeof(float));
    cudaMalloc(&d_values_in, num_items1 * sizeof(float));
    cudaMalloc(&d_values_out, num_items1 * sizeof(float));
    std::vector<float> keys_in(num_items1);
    std::vector<float> keys_out(num_items1);
    std::vector<float> values_in(num_items1);
    std::vector<float> values_out(num_items1);
    for (int i = 0; i < num_items1; i++) {
        keys_in[i] = dis(gen);
        values_in[i] = i;
    cudaMemcpy(d_keys_in, keys_in.data(), num_items1 * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_values_in, values_in.data(), num_items1 * sizeof(float), cudaMemcpyHostToDevice);

    void     *d_temp_storage1 = NULL;
    size_t   temp_storage_bytes1 = 0;
    cudaError_t err1 = cub::DeviceRadixSort::SortPairs(d_temp_storage1, temp_storage_bytes1,
                                                       d_keys_in, d_keys_out, d_values_in, d_values_out, num_items1);
    if (err1 != cudaSuccess) {
        std::cout << "Error1: " << cudaGetErrorString(err1) << std::endl;

    float *d_temp_storage2 = NULL;
    size_t temp_storage_bytes2 = 0;
    cudaError_t err2 = cub::DeviceRadixSort::SortPairs(d_temp_storage2, temp_storage_bytes2,
                                                       d_keys_in, d_keys_out, d_values_in, d_values_out, num_items2);
    if (err2 != cudaSuccess) {
        std::cout << "Error2: " << cudaGetErrorString(err2) << std::endl;

    std::cout << "temp_storage_bytes1: " << temp_storage_bytes1 << std::endl
              << "temp_storage_bytes2: " << temp_storage_bytes2 << std::endl;

    cudaMalloc(&d_temp_storage1, temp_storage_bytes1);
    cudaMalloc(&d_temp_storage2, temp_storage_bytes2);

    cudaError_t err = cub::DeviceRadixSort::SortPairs(d_temp_storage2, temp_storage_bytes1, // error here, because temp_storage_bytes1 < temp_storage_bytes2, even though I input the d_temp_storage2 with enough size
                                                      d_keys_in, d_keys_out, d_values_in, d_values_out, num_items2);
    if (err != cudaSuccess) {
        std::cout << "Error: " << cudaGetErrorString(err) << std::endl;

    cudaMemcpy(keys_out.data(), d_keys_out, num_items1 * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(values_out.data(), d_values_out, num_items1 * sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < 20; i++) {
        std::cout << keys_out[i] << " " << values_out[i] << std::endl;


    return 0;

When I run this, I get different temp_storage_bytes values for int and size_t, and it seems to affect the actual sorting call. The output is:

nvcc -o test_cub ./test_sortpairs.cu


num_items1: 5000
num_items2: 5000
temp_storage_bytes1: 743167
temp_storage_bytes2: 1396479
Error: invalid argument
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0

My environment details:

  • NVCC version: 12.2
  • CUDA Version: 12.2
  • Driver Version: 535.113.01
  • GPU: NVIDIA GeForce RTX 4090
    My questions are:
  1. Why does the type of NumItemsT (i.e., int vs. size_t) result in different temp_storage_bytes values?
  2. Is the variation in temp_storage_bytes affecting the actual sorting call an intended behavior?

Any insights or guidance on this issue would be greatly appreciated.

Thank you!

The NumItemsT implies the size of the sort that can be done. A 32-bit NumItemsT means the sort can be optimized for sorting datasets where the indexing fits in 32 bits. A 64-bit NumItemsT probably means that various indexing variables will need to be twice as large. I would imagine that this has implications for sorting.

I would expect that it potentially also has an impact on performance, as some of the data to be moved during sort may be 64-bit vs. 32-bit data.

Certain aspects of the sort operation are probably impacted. For example a radix sort may use lower-level parallel primitives such as prefix sums. For a 32-bit data set size, the prefix sum can be a 32-bit prefix sum. For a 64-bit data set size, the prefix sum would probably have to be a 64 bit operation.

cub is open source, so any of these ideas can probably be verified or refuted by looking through the sort operation and seeing what gets affected by the type associated with NumItemsT.

If NumItemsT has no impact along these lines, and/or presented no optimization opportunity along these lines, there would probably be little reason to make it part of the template specification.

As I believe you already know, you can fix the indicated invalid argument error by passing temp_storage_bytes2 instead of temp_storage_bytes1 in the call, to match your usage of NumItemsT there.

1 Like

Thanks, Robert!