Segmented Reduction of small subarrays

jimbear59 · March 22, 2022, 10:53pm

I have a large array of floats 1024 by 32768 and would like to perform operations on non-overlapping/tiled 64x64 sized subarrays like finding minimum values or summation. My test code is using an array of 1x32768 elements and extracting the sum of 64 non-overlapping segments. I am using block reductions as show by Robert Crovella in the forums:

const int ngroups = 511;
const int groupsz = 64;
const int halfblock = 32;  // choose as the minimum power of 2 greater than or equal to half of the groupsz

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){

  __shared__ T sdata[groupsz];
  sdata[threadIdx.x] = in[threadIdx.x + blockIdx.x*groupsz];
  __syncthreads();
  for (int i = halfblock; i>0; i>>=1){
    if ((threadIdx.x < i) && (threadIdx.x+i < groupsz))
      sdata[threadIdx.x] += sdata[threadIdx.x+i];
    __syncthreads();}
  if (!threadIdx.x) out[blockIdx.x] = sdata[0];
}

I was trying to extract sums from segments 64 elements long. The performance is very slow on a P620. I think that the algorithm is running into memory bank conflicts. I am very new to CUDA and practicing block reductions including tutorials and example code written by Harris work very well and max out the bandwidth of my card. Most examples I see use a block-glide addressing that works well as I understand it for large arrays. I have not seen examples yet that are optimized on extracting many small bits out of slightly larger bits. Any help would be appreciated. Thanks!

Robert_Crovella · March 22, 2022, 11:07pm

A single large (dataset size >> grid) reduction can exploit the fact that most of the reduction work can be accomplished simply using serial single-thread reduction techniques. Only at the final “stages” do we need to use the shared-memory sweep process. Thus, the work performed by the code is dominated by the loading of global memory, and with a grid-stride loop this is easy to make fully coalesced, resulting in a reduction that closely approaches achievable memory bandwidth as its performance upper bound.

In the case of a large number of small reductions, we are faced with the fact that we are doing a full block-level sweep for each 64 elements (in this case/example). The work the code is doing is no longer dominated by coalesced global loads, but has the other character as a notable fraction of the work.

So its going to be slower.

Rest assured that the code you have posted will still load global memory in a coalesced fashion, and will not have shared-memory bank conflicts.

This reduction technique however can in some cases be improved by using warp shuffle methods. You can follow the instructions here, or this training material will give an additional treatment of it.

For the example you have here, I would expect improvement. In the large single reduction case, it makes almost no difference, because the code activity is dominated by coalesced global loads, and only a small fraction of the work is the parallel sweep reduction part. For the case you have the work has a much more noticeable split towards the parallel sweep reduction, so the benefits of applying warp shuffle there should be more noticeable.

jimbear59 · March 22, 2022, 11:13pm

Robert, thanks very much. I will look at the links and try to apply them.

Robert_Crovella · March 22, 2022, 11:23pm

Perhaps something like this:

const int groupsz = 64;

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){

  __shared__ T sdata[2];  // specific for groupsz 64 case
 T val = in[threadIdx.x + blockIdx.x*groupsz];
 // warp-shuffle reduction
  for (int offset = 16; offset > 0; offset >>= 1) 
       val += __shfl_down_sync(0xFFFFFFFF, val, offset);
  if (!(threadIdx.x & 31)) sdata[threadIdx.x >> 5] = val;
  __syncthreads(); // put warp results in shared mem

  if (!threadIdx.x) out[blockIdx.x] = sdata[0] + sdata[1]; // specific for groupsz 64 case
}

That final line of code (as well as the shared memory size, probably) would need to be modified for the more general case of up to 32 warps. Basically, before the last line of code, you would insert an additional warp-level shuffle reduction, operating on only the first warp. The methodology is covered in the previous links.

Robert_Crovella · March 23, 2022, 12:00am

according to my testing on a Quadro K610M, for the dimensions given (ngroups=511, groupsz = 64), the shared memory sweep reduction runs in about 60us whereas the warp-shuffle reduction runs in about 33us, for T = float.

jimbear59 · March 23, 2022, 12:02am

WOW, thanks, I will try it tonight!

jimbear59 · March 23, 2022, 12:55am

Robert, It just occurred to me that since the individual subarrays are non-overlapping and all the same size, can’t I do a shuffle on an entire 32768 element line 5 times in succession and collect the values afterwards? I’ll try that too.

Robert_Crovella · March 23, 2022, 1:24am

I don’t know what that means. But sure give it a try.

jimbear59 · March 23, 2022, 1:29am

breaking up the 32768 element line into arbitrary 64 element segments is arbitrary as along as I collect the proper final values later? I was thinking of applying five shuffles in a row to a line without having it be granular at a segment level. Just let the card go parallel as much as possible?

Robert_Crovella · March 23, 2022, 1:36am

sorry, I’m not able to grasp your intent from the english description. I find code to be more descriptive, especially since we are talking about less than 20 lines of code. I suspect you don’t understand what warp shuffle can or can’t do, but really that’s just a guess.

A warp shuffle works on (at most) 32 elements, and it does so at the warp level. You could “break” up your 32768 elements across 1024 warps, and have each of those 1024 warps do a warp shuffle 5 times in a row. That description is more or less exactly what the code I provided does. When it comes to the very end, and you want to “combine” results across 64 input elements, its convenient if the two warps that processed those 64 elements belong to the same threadblock.

If you have another idea that works, great! Even better if it is faster!

jimbear59 · March 23, 2022, 4:48pm

Hi Robert, I was saying exactly what you just said only my idea was in gibberish. I will test the code this morning. Thank you again

jimbear59 · March 23, 2022, 8:40pm

I tried it and it appears to be blindingly fast. I also looked at the Video training documents and the webpages, thanks for the links, you broke a conceptual blockade for me.

jimbear59 · March 24, 2022, 8:05pm

Hi, can the same code be used to find the minimum value? I am using two ternary expressions to find a minimum and a temporary thread variable to hold the shifted thread value. Its not working and I think I am misapplying the concept of shuffle and thread to thread access by using the temp_val variable.

const int groupsz = 64;

// for this kernel we assume that we launch with block size (threads per block) == groupsz
template <typename T>
__global__ void blockReduction(const T * __restrict__ in, T * __restrict__ out){
  float temp_val;
  __shared__ T sdata[2];  // specific for groupsz 64 case

 T val = in[threadIdx.x + blockIdx.x*groupsz];
 // warp-shuffle reduction
  for (int offset = 16; offset > 0; offset >>= 1)  {
      temp_val =  __shfl_down_sync(0xFFFFFFFF, val, offset);
       val = val < temp_val ? val:temp_val;
       }
  if (!(threadIdx.x & 31)) sdata[threadIdx.x >> 5] = val;
  __syncthreads(); // put warp results in shared mem

  if (!threadIdx.x) out[blockIdx.x] = sdata[1] < sdata[0]  ? sdata[1]: sdata[0];  // specific for groupsz 64 case
}

Robert_Crovella · March 24, 2022, 8:32pm

I don’t seem to have any trouble with the code you have shown for reducing float quantities.

You might want to provide a complete example. Your error may be somewhere else.

Also, you can do as you wish, of course, but this:

 float temp_val;

pretty much defeats the purpose of templating. And if you are reducing something other than float values, that may be the problem.

If it were me, I would do:

T temp_val;

(I also don’t think you really need a temp_val, but that’s getting kind of subjective.)

jimbear59 · March 24, 2022, 10:08pm

Thanks, I messed up the gpu-to-host transfer before, but as long as I use the temp_val it works great. When I directly use the shuffled reference it doesn’t. still troubleshooting here regarding why this is. this is a terrific advance. THANKS VERY MUCH FOR YOUR HELP!

Robert_Crovella · March 24, 2022, 10:19pm

I wasn’t very clear when I said you don’t really need temp_val. What I meant was you could do something like:

  for (int offset = 16; offset > 0; offset >>= 1)  
       val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));

But its not necessarily “better” than what you have.

jimbear59 · March 24, 2022, 10:31pm

I like yours better in terms of not needing a dummy variable. Have a great weekend.

system · April 7, 2022, 10:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best way to find many minimums CUDA Programming and Performance	8	2366	January 3, 2018
Fastest single-warp min() algorithm for 32 values CUDA Programming and Performance	15	3669	June 20, 2011
Reduction through shared memory vs. shuffle CUDA Programming and Performance	8	6964	April 29, 2014
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6677	June 2, 2012
Efficient use of shared memory CUDA Programming and Performance	29	4460	December 2, 2019
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	41	August 21, 2024
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6512	April 13, 2009
Faster Parallel Reductions on Kepler Technical Blog	53	1842	September 4, 2021
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	816	September 6, 2023

Segmented Reduction of small subarrays

Related topics