reduction for an odd number of elements coalesced memory access

Reduction example is designed for multiples of 2 elements.
I have my own implementation for any number of elements, but memory accesses are not coalesced.
Have anyone think about this problem and has a solution for this that works with any number of elements?

Thrust's reduction implementation coalesces and works with arbitrary input size.

If you’re interested, you can view our current implementation here.

Thrust’s reduction code seems also to work only for powers of two. In fact in the reduce function it is stated :
// BLOCK_SIZE must be a power of 2

Anyway, thanks

You’re correct – in our implementation BLOCK_SIZE must be a power of two for performance reasons. But N, the total number of elements reduced, is allowed to vary arbitrarily. If BLOCK_SIZE does not evenly divide N, then the few remaining threads at the end of the last block simply idle.

Thanks, I going to try it.

Although Thrust’s reduction may help me improve some of my code in some cases I have to sum several group elements (each) of size n where n is not a power of two (say 50).

So here’s the question again. Does anyone knows a way of doing this in a coalesced way?

Is there a way of finding out the index that was chosen as the result of thrusts reduction? Ie if element 5 was chosen because it contained the smallest value, can 5 be returned?

You transform a reduction for arbitrary N into a classic reduction for a power-of-two by just adding a single IF statement to the first reduction step.

So here’s an example which is hardwired for 256:

/* reduction for N=256, assuming we have 128 or more threads for simplicity */

a[threadIdx.x]=a[threadIdx.x+128];

__syncthreads();

if (threadIdx.x<64) a[threadIdx.x]=a[threadIdx.x+64];

__syncthreads();

if (threadIdx.x<32) a[threadIdx.x]=a[threadIdx.x+32];

__syncthreads();

/* et-cetera */

We can transform this to use some maximum item count N, where 128 <= N <=256 by:

if (threadIdx.x+128<N) a[threadIdx.x]=a[threadIdx.x+128]; // note additional test for max N range

__syncthreads();

if (threadIdx.x<64) a[threadIdx.x]=a[threadIdx.x+64];

__syncthreads();

if (threadIdx.x<32) a[threadIdx.x]=a[threadIdx.x+32];

__syncthreads();

/* et cetera */

For truly arbitrary N, you can test which level to start at, or fold a test for N into each if statement.

If you want the index of the minimum (maximum) element of a range, use min_element (max_element) [1]:

#include <thrust/device_vector.h>

#include <thrust/extrema.h>

#include <iostream>

int main(void)

{

  thrust::device_vector<int> vec;

// init vector

  ...

unsigned int index_of_min = thrust::min_element(vec.begin(), vec.end()) - vec.begin();

std::cout << "min element is " << vec[index_of_min] << " at index " << index_of_min << std::endl;

}

[1] http://thrust.googlecode.com/svn/tags/1.1…24e141f85e79999