compute segmented sum using CUDA

Hi,

I would like to compute the sum of an array using warp aggregated atomics. I have a dataset with 5000 entries. For each entry, there are 441 elements with double data type. This 1-D data array is passed on to GPU memory with 5000*441 elements.

From this 1-D array, I need to compute segmented sum of each 441 element.
Could anyone suggest a best way to implement this functionality using CUDA?

I tried atomicAdd() but takes more time.
If I use Thrust library reduce function, I am not clear how to implement this segmented sum?

Example: sum_result[0]=sum(in_array(0 to 440))
sum_result[1]=sum(in_array(441 to 880))

     sum_result[4999]=sum(in_array(4558 to 4999))

If anyone can give some solution it will be greatly helpful.

Thanks

The warp aggregated atomics case has a number of special cases to consider:

  • each warp must account for the fact that it may straddle a group boundary, and therefore must possibly handle 2 warp level reductions and 2 atomic operations per warp
  • in addition, the corner case around end-of-grid handling, adds complexity to the conditionals above.
  • the warp level reduction must potentially have non-power-of-2 sizes and start on arbitrary boundaries.

As a further objection, use of atomics (even warp-aggregated atomics) for operations that can reasonably be done with ordinary reductions probably isn’t necessary – ordinary reductions are probably preferred (faster).

Therefore a pure warp aggregated reduction was more than I wanted to work on for a casual coding exercise. A modern treatment/example of warp aggregated atomics is covered here:

http://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf

e.g. on slide 48. However this case assumes we are only counting active threads and reserving space in queue. It is not actually doing a warp-level reduction also (unless you consider the use of popcnt to be a reduction)

What follows then is a treatment of 2 approaches:

  • thrust using reduce_by_key (which is a canonical segmented reduction algorithm)
  • a simplistic block level reduction

For your particular case, where the group size is reasonably large (441) and less than the max limit on threads per block (1024), I think it makes a lot of sense, at least performance wise, to assign each group to a block. With this method we can dispense with atomics entirely.

Here’s a worked example of both approaches:

$ cat t1345.cu
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <iostream>

const int ngroups = 5000;
const int groupsz = 441;
const int halfblock = 256;  // choose as the minimum power of 2 greater than or equal to half of the groupsz

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){

  __shared__ T sdata[groupsz];
  sdata[threadIdx.x] = in[threadIdx.x + blockIdx.x*groupsz];
  __syncthreads();
  for (int i = halfblock; i>0; i>>=1){
    if ((threadIdx.x < i) && (threadIdx.x+i < groupsz))
      sdata[threadIdx.x] += sdata[threadIdx.x+i];
    __syncthreads();}
  if (!threadIdx.x) out[blockIdx.x] = sdata[0];
}

using namespace thrust::placeholders;
int main(){

  thrust::device_vector<double> d(ngroups*groupsz, 1);
  thrust::device_vector<double> r(ngroups);
  unsigned long long dt = dtime_usec(0);
  thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/groupsz), thrust::make_transform_iterator(thrust::counting_iterator<int>(ngroups*groupsz), _1/groupsz), d.begin(), thrust::make_discard_iterator(), r.begin());
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  thrust::copy_n(r.begin(), 5, std::ostream_iterator<double>(std::cout, ","));
  std::cout << std::endl << "thrust time: " << dt/(float)USECPSEC << " seconds" << std::endl;

  thrust::device_vector<double> r2(ngroups);
  dt = dtime_usec(0);
  blockReduction<<<ngroups, groupsz>>>(thrust::raw_pointer_cast(d.data()), thrust::raw_pointer_cast(r2.data()));
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  thrust::copy_n(r2.begin(), 5, std::ostream_iterator<double>(std::cout, ","));
  std::cout << std::endl << "kernel time: " << dt/(float)USECPSEC << " seconds" << std::endl;

}

$ nvcc -arch=sm_35 -o t1345 t1345.cu
$ ./t1345
441,441,441,441,441,
thrust time: 0.002895 seconds
441,441,441,441,441,
kernel time: 0.000354 seconds
$

We see that for this particular example, the simplistic kernel approach is almost 10x faster than thrust. We shouldn’t be too hard on thrust here: I’ve saddled it with an integer division that I don’t have to do with my CUDA kernel, but the thrust case is more flexible, and maintainable.

Although I don’t show it here, running this code with cuda-memcheck indicates no runtime errors. CUDA 8, CentOS 7, Tesla K20x

Thanks txbob for your valuable inputs. I tried to run the block reduction approach for sample dataset and it works fine. To begin with I tested for ngroups = 1 and groupsz = 441.

To measure the kernel run time, I use cudaEventRecord() API as below. Using this approach, the
run time is measured as 0.207648 ms.
[b]
But for the example code above, when i ran on Tesla K20C card I got the results below:

441,441,441,441,441,
thrust time: 0.000807 seconds
441,441,441,441,441,
kernel time: 0.000396 seconds[/b]

I would like to know the reason behind this sudden increase in run time. Any suggestions shall be useful.

Thanks.

const int ngroups = 1;
const int groupsz = 441;
const int halfblock = 256; // choose as the minimum power of 2 greater than or equal to half of the groupsz

// for this kernel we assume that we launch with block size (threads per block) == groupsz
global void blockReduction(double *out, double *in){

shared double sdata[groupsz];
sdata[threadIdx.x] = in[threadIdx.x + blockIdx.x*groupsz];
__syncthreads();
for (int i = halfblock; i>0; i>>=1){
if ((threadIdx.x < i) && (threadIdx.x+i < groupsz))
sdata[threadIdx.x] += sdata[threadIdx.x+i];
__syncthreads();}
if (!threadIdx.x) out[blockIdx.x] = sdata[0];
}

int main()
{
FILE *A_f = fopen("/home/…/file.txt", “r”);

int i;
double h_A = (double)malloc(groupsz* sizeof(double));
double h_sum = (double)malloc(ngroups* sizeof(double));

if (A_f == NULL) { return 1; }
for (i = 0;i<groupsz;++i) {
fscanf(A_f, “%lf”, &h_A[i]);
}

for (i = 0;i<groupsz;++i) {
printf("\n h_A[%d]=%0.15f \n",i,h_A[i]);
}

for (i = 0;i<ngroups;++i) {
h_sum[i]=0.0f;
}

double d_A, d_sum;
cudaMalloc(&d_A, groupsz
sizeof(double));
cudaMalloc(&d_sum, ngroups
sizeof(double));

cudaMemcpy(d_A, h_A, groupsz* sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_sum, h_sum, ngroups* sizeof(double), cudaMemcpyHostToDevice);

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float elapsedTime1=0.0f;

// GPU sum

cudaEventRecord(start, 0);
blockReduction<<<ngroups, groupsz>>> (d_sum,d_A);
cudaDeviceSynchronize();

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime1 , start, stop);
printf("\n\n T1: Measured Time for sum of a sub-image = %f ms\n\n", elapsedTime1);

cudaMemcpy(h_sum,d_sum, ngroups* sizeof(double), cudaMemcpyDeviceToHost);
for (i = 0;i<ngroups;++i) {
printf("\n h_sum[%d]=%0.15f \n",i,h_sum[i]);
}

return 0;
}

You reported a run-time of 0.207648 ms for your code which executes a single block. If we convert this to seconds it is 0.000207648 seconds.

Then you reported a run-time of 0.000396 seconds for the code that runs 5000 blocks.

This is about 2x longer than your single block code. That seems quite plausible to me. I’m not sure why you are asking about the increase in time. Wouldn’t you expect that the code that runs 5000 blocks takes longer than the same code running just 1 block?

Thanks txbob.I need help for one more case where

const int ngroups = 5000;
const int groupsz = 3721;

In this case the kernel launch configuration and its definition will have to be changed because (groupsz = 3721) > 1024 and will not fit within a single thread block. I am not clear how to proceed for this case.
Any suggestions shall be greatly helpful.

Thanks.

use the thrust method

Another possibility would be to modify my previous example kernel method to use a “block-striding” kernel to handle the case where groupsz > 1024. The block-striding kernel method is similar conceptually to a grid-stride kernel method:

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

except we are doing it individually on each block. Here is a worked example:

$ cat t1345.cu
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <iostream>

const size_t ngroups = 5000;
const size_t groupsz = 3721;
const int blocksz = 1024;   // set equal to min(1024, groupsz)
const int halfblock = 512;  // choose as the minimum power of 2 greater than or equal to half of the blocksz

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__host__ __device__
T my_reduction_op(T &d1, T &d2){

  return d1+d2;
}

template <typename T>
__host__ __device__
T my_reduction_init(){
  return 0;
}

// for this kernel we assume that the above reduction functions are properly specified
// and that blocksz and halfblock are set correctly per the requirements above
template <typename T>
__global__ void blockReduction(T * __restrict__ in, T * __restrict__ out){

  __shared__ T sdata[blocksz];
  size_t idx = threadIdx.x+groupsz*blockIdx.x;
  sdata[threadIdx.x] = my_reduction_init<T>();
  while (idx < (groupsz*(blockIdx.x+1))){  // block-striding
    sdata[threadIdx.x] = my_reduction_op(sdata[threadIdx.x], in[idx]);
    idx += blockDim.x;}
  __syncthreads();
  for (int i = halfblock; i>0; i>>=1){
    if ((threadIdx.x < i) && (threadIdx.x+i < groupsz))
      sdata[threadIdx.x] = my_reduction_op(sdata[threadIdx.x], sdata[threadIdx.x+i]);
    __syncthreads();}
  if (!threadIdx.x) out[blockIdx.x] = sdata[0];
}

using namespace thrust::placeholders;
int main(){

  thrust::device_vector<double> d(ngroups*groupsz, 1);
  thrust::device_vector<double> r(ngroups);
  unsigned long long dt = dtime_usec(0);
  thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<size_t>(0), _1/groupsz), thrust::make_transform_iterator(thrust::counting_iterator<size_t>(ngroups*groupsz), _1/groupsz), d.begin(), thrust::make_discard_iterator(), r.begin());
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  thrust::copy_n(r.begin(), 5, std::ostream_iterator<double>(std::cout, ","));
  std::cout << std::endl << "thrust time: " << dt/(float)USECPSEC << " seconds" << std::endl;

  thrust::device_vector<double> r2(ngroups);
  dt = dtime_usec(0);
  blockReduction<<<ngroups, blocksz>>>(thrust::raw_pointer_cast(d.data()), thrust::raw_pointer_cast(r2.data()));
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  thrust::copy_n(r2.begin(), 5, std::ostream_iterator<double>(std::cout, ","));
  std::cout << std::endl << "kernel time: " << dt/(float)USECPSEC << " seconds" << std::endl;

}

$ nvcc -arch=sm_35 -lineinfo -o t1345 t1345.cu
$ cuda-memcheck ./t1345
========= CUDA-MEMCHECK
3721,3721,3721,3721,3721,
thrust time: 0.296156 seconds
3721,3721,3721,3721,3721,
kernel time: 0.096133 seconds
========= ERROR SUMMARY: 0 errors
$

Note that the above method is a “generalization” of the previous method so should work (give correct results) for any groupsz of 1 or greater (assuming blocksz and halfblock are chosen correctly), for an arbitrary number of groups, 1 or greater. For the cases we have been talking about (e.g. ngroups = 5000, groupsz = 441 or 3721) it should be fairly efficient. There will be edge cases (e.g. ngroups = 1 or a small number, or groupsz is small) where this won’t be an efficient method. I wouldn’t be surprised if thrust was potentially a lot faster for some of those cases.

I’ve also further “generalized” the above kernel approach to allow for an arbitrary user-defined binary reduction operation, specified by my_reduction_op and my_reduction_init functions. They are presently set up for a sum-reduction as we have been discussing here.

A few comments about timing:

  • cuda-memcheck in general slows things down. So don’t compare timing numbers obtained using cuda-memcheck against those obtained without it.
  • The timing for this routine will be longer, one reason is that it is doing more work (50003721 is larger than 5000441)

A modified version of the previous code which allows for variable group sizes is here:

https://devtalk.nvidia.com/default/topic/1028130/cuda-programming-and-performance/best-way-to-find-many-minimums/post/5229824/#5229824

https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html

https://moderngpu.github.io/segreduce.html

Thank you txbob and BulatZiganshin.