Performing N reductions at the same time - N reduction kernels or one kernel for all N reductions?

At each step of my simulation, I have to reduce N on the order of 10-30 large (256^3) arrays. I’m using doubles, so shared memory space might be a constraint, but typically is manageable.

My question is: should I reduce each of the N arrays in their own kernel (which could be running simultaneously), or write my reduction kernel to load up blocks of all N arrays to perform the current step of the reduction of all N arrays at once?

You can probably use a zip iterator and then define your own addition operator using a device-side lambda and/or a user-defined callable object.

I’m not concerned about the implementation either way; this is just a question about which is “best practice” w.r.t. speed. The c++ methods you reference are both beyond the scope of 1) my knowledge (I mostly use c capabilities) and 2) what’s necessary, I think.

I also realized that my method won’t work as well…


I’d say, just launch a kernel per array then. It’s simple and straight-forward. I’m curious if there’s a more efficient way of doing this but I don’t think you’ll know until you profile and benchmark.

Ok, thanks for the input. I do like it for being simpler and general. And then I can have multiple simultaneous kernels. I think it should be faster for that reason, since the memory transfers from device to SMX memory would be identical in any case. But maybe there’s something to having each thread of the reduction kernels having more work to do…

Well, optimizing software is nonsense when you can just buy faster hardware :P

My general rule of thumb is, write something stupid simple first. See if it’s slow and if it is, find the bottlenecks and design your optimizations around that.

I feel like if you’d try to cram a bunch of reductions into one kernel, you’d be slaughtered by memory throughput.

One thing you might be able to do is reduce over a zip iterator range and have the output value be a tuple. Let me see if I can get a sample working for ya.

Okay, got a working sample where each range is individually summed.

// nvcc --expt-extended-lambda -O3 -gencode arch=compute_61,code=sm_61 -o zip-reduce
#include <iostream>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/tuple.h>
#include <thrust/execution_policy.h>

namespace T = thrust;

int main(void)
  auto const num_vals = size_t{3000};

  auto const zip_begin = T::make_zip_iterator(

  using tuple_t = decltype(*zip_begin);

  auto const zip_end = zip_begin + num_vals;

  auto const tuple_sum = T::reduce(
    zip_begin, zip_end, 
    T::make_tuple(0, 0), 
    [] __host__ __device__ (tuple_t const& a, tuple_t const& b) -> tuple_t
      return T::make_tuple(
        T::get<0>(a) + T::get<0>(b),
        T::get<1>(a) + T::get<1>(b));

  std::cout << T::get<0>(tuple_sum) << ", " << T::get<1>(tuple_sum) << "\n";

  return 0;

Alright, I tried my hardest to get a variadic solution going but unfortunately, Thrust was not coded with variadics in mind so their tuple type is padded with Thrust null_type’s which is incredibly problematic for implementing decent variadic tuple summation.


Such is life. But you can manually hard-code solutions that are still pretty decent. It just sucks that they’re not fully genericized.