Regarding tempstorage in cub library functions in cuda

Hi everyone,
I am trying to understand how to use the cub library functions efficiently. Consider the following example. Here we are finding the sum of warp elements using reduction algorithm.

#include <cub/cub.cuh>
#include <cuda.h>
#include <vector>
using std::vector;
#include <iostream>
using std::cout;
using std::endl;

const int N = 128;

__global__ void sum(float *indata, float *outdata) {
    typedef cub::WarpReduce<float> WarpReduce;
    __shared__ typename WarpReduce::TempStorage temp_storage[4];
    int id = blockIdx.x*blockDim.x+threadIdx.x;
    int warp_id = id/32;
    if( id < N ) {
        outdata[id] = WarpReduce(temp_storage[warp_id]).Sum(indata[id]);

int main() {
    vector<float> y(N), sol(N);
    float *dev_y, *dev_sol;
    for( int i = 0; i < N; i++ ) {
        y[i] = (float)i;
    cout << "input: ";
    for( int i = 0; i < N; i++ ) cout << y[i] << "\t";
    cout << endl;
    cout << "output: ";
    for( int i = 0; i < N; i++ ) cout << sol[i] << "\t";
    cout << endl;
    return 0;

My question is what happens when the temp_storage[4] is decreased/changed to temp_storage[1]…? (although I am getting the correct answer for both the cases) and how to identify the optimum temp_storage size…?

You need one TempStorage object per warp. That is how cub must be used. Since you are using the default 32 threads per warp for WarpReduce, you need at least128 / 32 = 4 objects of TempStorage. There is no other “optimal” array size for temp_storage.

That aside, the actual occupied shared memory size depends on the gpu architecture and the number of threads per logical “warp” used for the reduction. In some configurations, no shared memory is used at all for the reduction which would explain that it “works” in your case with any array size of temp_storage

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.