Best way to find many minimums

I have simplified my kernel so that the data to minimised are
sequential in a single global int array. The goal is to find the
minimum of each chunk and write it to a global output array.
The chunks are different sizes (although packed small to big).

The kernel reads a block size int into shared memory and uses the shared
memory to find the min of all the chunks in the current block using reduction.
This is complicated because of the chunk boundaries but in most cases
the block will contain only one chunk. Finally one or more min int values
are written to global memory using atomicMin. The value returned by atomicMin
is not used and the block terminates. Many global outputs are updated just once,
but typically 20+ blocks contribute to one global output, each needing 20+
atomicMin updates.

I am using a block size of 128 threads, CUDA 7.0 on GeForce GTX 745
although I would of course like it to work on any recent GPU.

Since there is no data reuse on input, why does nvvp claim an L2 cache hit rate of 50% ?

Would it be better to use zero memory rather than explicit pinned cudaMemcpy ?
(the input array varies from 0 to 16MBytes, 0 to 3000 chunks.
nvvp claims typical PCIe transfer rates of 12.7GB/s in and 2.9GB back to the host.)

Is shared memory reduction the best way to go?
(I have seen mention of threadfence but am not sure how it synchronises
across blocks).

Also I have seen mention of doing reductions with kepler shuffle rather
than shared memory but:

  1. it looks horrendous
  2. there is no shortage of shared memory (using one int per thread)
  3. the bottle neck seems to be reading global memory and/or atomicMin write to global

nvvp complains of poor use of shared memory (but a typical kernel
has a “shared efficiency” of 87.8%)

nvvp complains of low “Global Load Efficiency” (a typical kernel has 40%)

I have ignored the output of atomicMin. Will the compiler ensure the hardware
does not try to return the unwanted original value? Would this make the
atomic operation more efficient?

I am not sure how to use either nvvp or nvprof to understand or improve
access to global memory reads of atomic writes.

Any help or guidance you can give would be most welcome

A recent discussion on performing a set of parallel reductions (a segmented reduction) is here:

One of the things demonstrated there was the use of thrust::reduce_by_key which should be adaptable to your case. The difference between min-of-segment and sum-of-segment is not difficult to include. The straight CUDA code version expects that every segment will be the same size, however it’s not difficult to modify that code to allow for variable sized segments.

Depending on your actual access pattern, the L2 hit rate may be affected by data accessed by one warp pulling data into the L2 so it is present for another warp. You don’t need reuse of a specific item to see this, just “reuse” of the same cache line. Furthermore, a previous cudaMemcpy operation may populate the L2, resulting in hits even though your kernel has not previously touched the data.

If you’re talking about using zero-copy memory for your data, and not actually transferring it to the GPU, I doubt it. Why not try it?

shared memory is generally fast and is canonically recommended. The threadfence technique is used to reduce the number of kernel launches in a typical ping-pong reduction to just 1. Study mark harris’ canonical cuda parallel reduction, then study the threadfence reduction sample code. They both still use shared memory for the sweep within the block, to produce the block-level result. The difference is that the canonical parallel reduction writes a block-level result, whereas threadfence does the accumulation of block-level results after draining all the blocks, and determining which is last.

Yes, it complicates the code. It’s for people (usually, in my experience) who are going after the last 5-10% of performance available. You can read the parallelforall blog article on it. You don’t use it as a result of a shortage of shared memory, you use it when you want to go after the last 5% of performance. Also, on Kepler and beyond, global atomics are pretty fast (there is a paralleleforall blog article on that too), however since you are doing a segmented reduction, if you look at the methodology I discussed/linked at the beginning of my comments, if you really want to you can dispense with atomics altogether.

The global load efficiency of 40% does sound low to me. I would inspect the access patterns. 50% should be achievable, and probably higher, depending on your “chunk” sizes, up to maybe 80% for large chunk sizes. I can’t speak to whatever you’re talking about w.r.t. shared memory without a lot more data, but I would start by implementing (or using) a canonical algorithm, and as a benchmark probably compare whatever I wrote to thrust::reduce_by_key

Generally speaking, yes, the compiler should do that. It’s easy enough to verify by grepping the sass for ATOM or RED instructions.

Thanks Bob,
lots to think about!

The current approach is to remove all the atomic operations by writing each
output only once. This is done by replacing the original one thread per input
nu one block per output. Ie each of the 3000 odd outputs is processed by a
block of threads which do a reduction in shared memory, save the intermediate
answer in a register and then move onto the next 64/128 inputs again do a reduction
in shared memory and then minimise with the value from the previous 64/128 and
save the new min back into the register. Once the end of the chunk has been reached
the final minimum is written to global memory.

I think my kernel is still limited by reading from global memory, GeForce GTX 745,
but nvprof suggests it is now about 36% faster.

a handful of blocks will not use all their threads
Global memory writes cannot be coalesced


It should be possible to read and minimize without doing any sort of parallel reduction sweep, until all the reading of input data is done by the block. Take a look at the code I indicated earlier. It appears that you are doing a full shared memory sweep each time you load a set of 64/128 inputs. This should not be necessary.

Here’s a modified version of my code from the other thread that provides a variable-size group reduction method:

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <iostream>

const size_t ngroups = 5000;
const size_t groupsz = 3721;
const int blocksz = 1024;   // set equal to min(1024, groupsz)
const int halfblock = 512;  // choose as the minimum power of 2 greater than or equal to half of the blocksz

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;

template <typename T>
__host__ __device__
T my_reduction_op(T &d1, T &d2){

  return d1+d2;

template <typename T>
__host__ __device__
T my_reduction_init(){
  return 0;

// for this kernel we assume that the above reduction functions are properly specified
// and that blocksz and halfblock are set correctly per the requirements above
// this variant allows for variable-sized groups.  markers[] contains the starting offset of each group.
template <typename T>
__global__ void vblockReduction(T * __restrict__ in, T * __restrict__ out, const size_t * __restrict__ markers){

  __shared__ T sdata[blocksz];
  size_t idx = threadIdx.x+markers[blockIdx.x];
  sdata[threadIdx.x] = my_reduction_init<T>();
  while (idx < markers[blockIdx.x+1]){  // block-striding
    sdata[threadIdx.x] = my_reduction_op(sdata[threadIdx.x], in[idx]);
    idx += blockDim.x;}
  for (int i = halfblock; i>0; i>>=1){
    if ((threadIdx.x < i) && (threadIdx.x+i < groupsz))
      sdata[threadIdx.x] = my_reduction_op(sdata[threadIdx.x], sdata[threadIdx.x+i]);
  if (!threadIdx.x) out[blockIdx.x] = sdata[0];

using namespace thrust::placeholders;
int main(){

  thrust::device_vector<double> d(ngroups*groupsz, 1);
  thrust::device_vector<double> r(ngroups);
  unsigned long long dt = dtime_usec(0);
  thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<size_t>(0), _1/groupsz), thrust::make_transform_iterator(thrust::counting_iterator<size_t>(ngroups*groupsz), _1/groupsz), d.begin(), thrust::make_discard_iterator(), r.begin());
  dt = dtime_usec(dt);
  thrust::copy_n(r.begin(), 5, std::ostream_iterator<double>(std::cout, ","));
  std::cout << std::endl << "thrust time: " << dt/(float)USECPSEC << " seconds" << std::endl;

  thrust::device_vector<double> r2(ngroups);
  thrust::device_vector<size_t> m(ngroups+1);
  thrust::sequence(m.begin(), m.end(), (size_t)0, groupsz);
  dt = dtime_usec(0);
  vblockReduction<<<ngroups, blocksz>>>(thrust::raw_pointer_cast(, thrust::raw_pointer_cast(, thrust::raw_pointer_cast(;
  dt = dtime_usec(dt);
  thrust::copy_n(r2.begin(), 5, std::ostream_iterator<double>(std::cout, ","));
  std::cout << std::endl << "kernel time: " << dt/(float)USECPSEC << " seconds" << std::endl;


although it is currently written as a sum-reduction, there are generalizing functions (my_reduction_init, my_reduction_op) which would easily allow it to be converted to a min-reduction.

Dear Bob,
Thank you for bearing with me. The earlier version of

I looked at was dealing with a groupsz less than the maximum number of threads (441)
and so was simpler than the one you kindly posted.

I am not good with templates (they did not work when originally introduced
into GCC and I lost faith and never went back to try and use them again).
Ok, being abit dim, what you do is maintain a running sum in the shared memory
array. Something like
for(thread = threadIdx.x;thread < top; thread += blockDim.x){
sum[[threadIdx.x] += a[thread];
(NB no __syncthreads) and then do the reduction, just once, at the end.

I will try this:-)

Many thanks

I would urge you to reconsider with respect to CUDA. While I am not a big fan of the massive complexity of C++ (at all), templates are arguably the single most useful C++ feature for CUDA programmers, and their use can lead to substantial performance increases as configuration variables turn into template arguments. The CUDA toolchain has had properly working template support since about 2009, so everything related to templates should be rock solid at this stage.

Dear Bob,
I have just tried moving the reduction out of my for loop,
as you suggested, and the kernel is approx twice as fast:-)

Many thanks