Has anyone here used CUB in the past?

I’m looking into using CUB and parts of it seem weird. I’m attracted to it because it’s supposed to be faster than Thrust but it looks like a massive step down in terms of usability. Here’s an example of how to find the smallest integer in a given range using their lib:

#include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for input and output
int          num_items;  // e.g., 7
int          *d_in;      // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_out;     // e.g., [-]
CustomMin    min_op;
int          init;       // e.g., INT_MAX
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run reduction
cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);

// d_out <-- [0]

Namely, there’s to calls to the Reduce kernel.

The reasoning is explained as:

That seems like a weird design choice, right?

Idk. Has anyone used CUB and liked it? It’s a very strange looking API, just glancing at the code.

I have never used CUB, but the “dry-run-to-query-workspace-requirements” approach seems vaguely familiar from LAPACK (LWORK = -1). So one might call this design style a bit old-fashioned. Thrust is more in tune with the latest C++ design styles.

I’ve used both, but not for any large production work. CUB is “CUDA UnBound”. If thrust works for you, it’s generally easier than using CUB, by almost any measure. And since Thrust uses CUB under the hood for a number of operations, saying “CUB is supposed to be faster than Thrust” is a questionable claim IMO.

As @njuffa points out, the dry run to query workspace thing is definitely not unique to CUB. Take a look at MAGMA, for example.

There are a great many things you can do with CUB that are essentially impossible with thrust. For example a warp-level or block-level prefix sum.

I actually find CUB and thrust to be very similar. I wish CUB supported all the device-wide stuff that thrust does, but CUB is really designed to be a building-block approach, which allows you to more carefully tune what you are doing.

Again, if you can easily do what you want in thrust, there is probably little or no reason to use CUB.

Sounds like good advice. Thanks, guys.

I’m also a little surprised to hear that the dry run thing is a norm. It’s so strange, why not just write some other kind of diagnostic tool to help you determine launch bounds?

It’s not launch bounds. It’s the API method of figuring out how much scratch space is needed (clearly, since the returned value is passed directly to cudaMalloc). And the determination of how much scratch space is needed may depend on each and every argument that is passed to the function call (in the general case, when thinking about generalizing the method across an entire API). So what function prototype might give you all the arguments you need to make such a determination? Why, the function itself!

Then the programmer doesn’t have to remember another API. And it makes for cut-and-paste. It may be “old school”, but it seems quite sensible to me.

I have never seen a published rationale for this style of interface from the designera of LAPACK or MAGMA, but txbob’s explanation seems eminently plausible to me.

Note that LAPACK uses a different mechanism to report machine-specific configuration choices (e.g. crossover points, block sizes) in the form of the ILAENV function, which reports the relevant parameters for a whole slew of different LAPACK functions.

It’s the API method of figuring out how much scratch space is needed […]
And the determination of how much scratch space is needed may depend on each and every argument that is passed to the function call (in the general case, when thinking about generalizing the method across an entire API).

Ehr, sorry, no.

In many cases you can split your inputs in “parameters” that customise the function behaviour, allow trade-offs between memory and complexity, etc., and “arguments”, that are the inputs and outputs of your (customized) function.

For example, if you want to sort a block of keys, and you need to allocate a temporary buffer, you care about the size of each key and the size of the block - but not about the actual keys.

So what function prototype might give you all the arguments you need to make such a determination? Why, the function itself!

Sure.
Of course this approach is also ugly, confusing, requires passing extra parameters to a function that does not need them, and harder to use to setup your buffers once and reuse them multiple times.