Regarding cublib reduce

I am not sure whether it is right to ask this here. I am using the cub(CUB: Main Page) block-wise reduce primitive to find the minimum of a given array. The max function works fine but the min function always returns zero. Please tell me where I am going wrong. The code is below:

__global__ void reduce_min(long *arr, long n, long *aggregate){
        typedef BlockReduce<int, 128> BlockReduce;
        __shared__ typename BlockReduce::TempStorage temp_storage;
        int thread_data[4];
        LoadDirectBlocked<128>(threadIdx.x, arr, thread_data);
        int res = BlockReduce(temp_storage).Reduce(thread_data, Min());
        if(threadIdx.x == 0){
                *aggregate = res;

arr is the input array n is the size of the array and aggregate stores the result. I am using 128 thread per block with 4 elements per thread. The kernel launch is below:

dim3 blk(n/128.0 + 1, 1, 1);
      reduce_min<<<blk,128>>>(arr, n, res);

I’m not familiar with cublib, but isn’t it something that is already implemented in Thrust?
You may want to have a look at it, as Thrust is already there in your CUDA installation and AFAIK, a reduction literally needs 1 or 2 lines of code. Then you are guaranteed the result will be correct.

Thanks for the reply, but I am using a few structures for the data handling, I also need some block wide reductions and I also want to perform some other operation within the same kernel after the reduce.

This is quite strange:

First of all, why are you using floating-point arithmetic? Second, when n is divisible by the block size, you are always launching 1 extra block (I don’t think you want to be launching “extra” blocks with this code.) Since all blocks in your grid will write to the same output variable, you are going to have different results written to the same aggregate location.

If you are using 4 items per thread, then presumably 1 block would handle 4*128 = 512 items, so presumably your n is 512. But if you put n=512 into this code, you will get 5 blocks launched!

That doesn’t make any sense to me.

You’ll need to fix the number of blocks you launch, and if you intend to launch multiple blocks, where each block is computing a separate result, you’ll need to make it so that different blocks write to different locations in the output array.

There are other problems with your code.

  • you are mixing int and long where you shouldn’t be. Pick one or the other and use it consistently.
  • the <128> template parameter for LoadDirectBlocked is simply incorrect. The only numerical template parameter that function takes is the number of items per thread (which is 4 in your case) and anyway the intent is that all the template parameters should be inferred anyway.

(this code would not have worked correctly with max function either, in fact it wouldn’t compile)
The following code seems to work correctly for me:

$ cat
#include <cub/cub.cuh>
#include <iostream>

__global__ void reduce_min(long *arr, long *aggregate){
        typedef cub::BlockReduce<long, 128> BlockReduce;
        __shared__ typename BlockReduce::TempStorage temp_storage;
        long thread_data[4];
        cub::LoadDirectBlocked(threadIdx.x, arr, thread_data);
        long res = BlockReduce(temp_storage).Reduce(thread_data, cub::Min());
        if(threadIdx.x == 0){
                *aggregate = res;

int main(){
  const int n = 512;
  dim3 blk(n/512, 1, 1);
  long *arr, *h_arr, *res;
  cudaMalloc(&arr, n*sizeof(long));
  cudaMalloc(&res, sizeof(long));
  h_arr = (long *)malloc(n*sizeof(long));
  for (int i = 0; i < n; i++) h_arr[i] = n-i;
  cudaMemcpy(arr, h_arr, n*sizeof(long), cudaMemcpyHostToDevice);
  reduce_min<<<blk,128>>>(arr, res);
  cudaMemcpy(h_arr, res, sizeof(long), cudaMemcpyDeviceToHost);
  std::cout << h_arr[0] << std::endl;

$ nvcc -o t313
$ cuda-memcheck ./t313
========= ERROR SUMMARY: 0 errors

CUDA 10.0, cub 1.8.0, Tesla V100 PCIE, CentOS 7

Thanks a lot for the reply, I am so sorry for the mixup. Actually, I was previously using the loadDirectStriped from the cub lib example, then I read about loadDirectBlocked but did not changed the function properly before posting. I am so sorry about that.

My code (with no other changes) certainly does not work correctly for n greater than 512. You seem to have not understood or not read some of my comments.

I think you are talking about this part

I have changed it according to your comments. Thanks a lot.

Yes, an additional modification would be needed for multiblock use, because my code (and your original code) only write to a single output location, independent of the block.

If you’ve modified the code to address that, I think it should work for any n that is whole number divisible by 512.