Segmented Reduction of small subarrays

I have a large array of floats 1024 by 32768 and would like to perform operations on non-overlapping/tiled 64x64 sized subarrays like finding minimum values or summation. My test code is using an array of 1x32768 elements and extracting the sum of 64 non-overlapping segments. I am using block reductions as show by Robert Crovella in the forums:

const int ngroups = 511;
const int groupsz = 64;
const int halfblock = 32;  // choose as the minimum power of 2 greater than or equal to half of the groupsz

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){

  __shared__ T sdata[groupsz];
  sdata[threadIdx.x] = in[threadIdx.x + blockIdx.x*groupsz];
  __syncthreads();
  for (int i = halfblock; i>0; i>>=1){
    if ((threadIdx.x < i) && (threadIdx.x+i < groupsz))
      sdata[threadIdx.x] += sdata[threadIdx.x+i];
    __syncthreads();}
  if (!threadIdx.x) out[blockIdx.x] = sdata[0];
}

I was trying to extract sums from segments 64 elements long. The performance is very slow on a P620. I think that the algorithm is running into memory bank conflicts. I am very new to CUDA and practicing block reductions including tutorials and example code written by Harris work very well and max out the bandwidth of my card. Most examples I see use a block-glide addressing that works well as I understand it for large arrays. I have not seen examples yet that are optimized on extracting many small bits out of slightly larger bits. Any help would be appreciated. Thanks!

A single large (dataset size >> grid) reduction can exploit the fact that most of the reduction work can be accomplished simply using serial single-thread reduction techniques. Only at the final “stages” do we need to use the shared-memory sweep process. Thus, the work performed by the code is dominated by the loading of global memory, and with a grid-stride loop this is easy to make fully coalesced, resulting in a reduction that closely approaches achievable memory bandwidth as its performance upper bound.

In the case of a large number of small reductions, we are faced with the fact that we are doing a full block-level sweep for each 64 elements (in this case/example). The work the code is doing is no longer dominated by coalesced global loads, but has the other character as a notable fraction of the work.

So its going to be slower.

Rest assured that the code you have posted will still load global memory in a coalesced fashion, and will not have shared-memory bank conflicts.

This reduction technique however can in some cases be improved by using warp shuffle methods. You can follow the instructions here, or this training material will give an additional treatment of it.

For the example you have here, I would expect improvement. In the large single reduction case, it makes almost no difference, because the code activity is dominated by coalesced global loads, and only a small fraction of the work is the parallel sweep reduction part. For the case you have the work has a much more noticeable split towards the parallel sweep reduction, so the benefits of applying warp shuffle there should be more noticeable.

Robert, thanks very much. I will look at the links and try to apply them.

Perhaps something like this:

const int groupsz = 64;

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){

  __shared__ T sdata[2];  // specific for groupsz 64 case
 T val = in[threadIdx.x + blockIdx.x*groupsz];
 // warp-shuffle reduction
  for (int offset = 16; offset > 0; offset >>= 1) 
       val += __shfl_down_sync(0xFFFFFFFF, val, offset);
  if (!(threadIdx.x & 31)) sdata[threadIdx.x >> 5] = val;
  __syncthreads(); // put warp results in shared mem

  if (!threadIdx.x) out[blockIdx.x] = sdata[0] + sdata[1]; // specific for groupsz 64 case
}

That final line of code (as well as the shared memory size, probably) would need to be modified for the more general case of up to 32 warps. Basically, before the last line of code, you would insert an additional warp-level shuffle reduction, operating on only the first warp. The methodology is covered in the previous links.

1 Like

according to my testing on a Quadro K610M, for the dimensions given (ngroups=511, groupsz = 64), the shared memory sweep reduction runs in about 60us whereas the warp-shuffle reduction runs in about 33us, for T = float.

1 Like

WOW, thanks, I will try it tonight!

Robert, It just occurred to me that since the individual subarrays are non-overlapping and all the same size, can’t I do a shuffle on an entire 32768 element line 5 times in succession and collect the values afterwards? I’ll try that too.

I don’t know what that means. But sure give it a try.

breaking up the 32768 element line into arbitrary 64 element segments is arbitrary as along as I collect the proper final values later? I was thinking of applying five shuffles in a row to a line without having it be granular at a segment level. Just let the card go parallel as much as possible?

sorry, I’m not able to grasp your intent from the english description. I find code to be more descriptive, especially since we are talking about less than 20 lines of code. I suspect you don’t understand what warp shuffle can or can’t do, but really that’s just a guess.

A warp shuffle works on (at most) 32 elements, and it does so at the warp level. You could “break” up your 32768 elements across 1024 warps, and have each of those 1024 warps do a warp shuffle 5 times in a row. That description is more or less exactly what the code I provided does. When it comes to the very end, and you want to “combine” results across 64 input elements, its convenient if the two warps that processed those 64 elements belong to the same threadblock.

If you have another idea that works, great! Even better if it is faster!

Hi Robert, I was saying exactly what you just said only my idea was in gibberish. I will test the code this morning. Thank you again

I tried it and it appears to be blindingly fast. I also looked at the Video training documents and the webpages, thanks for the links, you broke a conceptual blockade for me.

Hi, can the same code be used to find the minimum value? I am using two ternary expressions to find a minimum and a temporary thread variable to hold the shifted thread value. Its not working and I think I am misapplying the concept of shuffle and thread to thread access by using the temp_val variable.

const int groupsz = 64;

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){
  float temp_val;
  __shared__ T sdata[2];  // specific for groupsz 64 case

 T val = in[threadIdx.x + blockIdx.x*groupsz];
 // warp-shuffle reduction
  for (int offset = 16; offset > 0; offset >>= 1)  {
      temp_val =  __shfl_down_sync(0xFFFFFFFF, val, offset);
       val = val < temp_val ? val:temp_val;
       }
  if (!(threadIdx.x & 31)) sdata[threadIdx.x >> 5] = val;
  __syncthreads(); // put warp results in shared mem

  if (!threadIdx.x) out[blockIdx.x] = sdata[1] < sdata[0]  ? sdata[1]: sdata[0];  // specific for groupsz 64 case
}

I don’t seem to have any trouble with the code you have shown for reducing float quantities.

You might want to provide a complete example. Your error may be somewhere else.

Also, you can do as you wish, of course, but this:

 float temp_val;

pretty much defeats the purpose of templating. And if you are reducing something other than float values, that may be the problem.

If it were me, I would do:

T temp_val;

(I also don’t think you really need a temp_val, but that’s getting kind of subjective.)

Thanks, I messed up the gpu-to-host transfer before, but as long as I use the temp_val it works great. When I directly use the shuffled reference it doesn’t. still troubleshooting here regarding why this is. this is a terrific advance. THANKS VERY MUCH FOR YOUR HELP!

I wasn’t very clear when I said you don’t really need temp_val. What I meant was you could do something like:

  for (int offset = 16; offset > 0; offset >>= 1)  
       val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));

But its not necessarily “better” than what you have.

I like yours better in terms of not needing a dummy variable. Have a great weekend.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.