Fastest single-warp min() algorithm for 32 values

I am trying to make the fastest single-warp min() algorithm for processing 32 values where the numbers we are going to compare each come from a different thread in the same warp. Thus, we start with 32 values

kernelfunc()
{

int num = …;

// insert new code here to get the minimum of all the "num" values
// in some new variable in the 1st thread.

....

}

What would be the fastest code for this in CUDA, considering performance implications of reads-after-writes, writes-after-reads, bank conflicts, etc.?

I am considering a simple reduction, but I’m worried about the read-after-write/write-after-read latencies. I want to minimize these as, due to data size, I may only be able to fit a single warp executing concurrently on a multiprocessor.

Any suggestions?

Thanks,
Raystonn

Example 2 in this talk goes through the process of optimizing a parallel reduction for CUDA. It is intended for a large scale reduction in global memory, but the example should be adaptable to your case:

[url=“http://www.gpgpu.org/sc2007/SC07_CUDA_5_Optimization_Harris.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op...tion_Harris.pdf[/url]

It goes through all of the possible tricks you could employ, including how to avoid shared memory bank conflicts and loop unrolling.

I recently wrote the following code for a half-warp max/min. Its based on Mark Harris’s example from the talk. Note that it takes no care of synchronization because by definition a warp executes simultaneously. So the code won’t work in emulation unless you used EMUSYNC. I don’t think there are any bank conflicts, but you’re welcome to spot any. I also welcome you to suggest improvements.

Hope it helped. Please let us know how you finally wrote the routine.

Anjul

I use the following, which works pretty well. The obvious race condition makes it appear nonsensical at first glance, but closer inspection reveals that although the behavior of the logic path is undefined, the final result is deterministic and correct.

double my_value;

__shared__ double min_value;

min_value = 1.7976931348623158e+308

my_value = compute_my_value(...);

while (my_value<min_value)

{

    min_value = my_value;

}

Worst case is n, but since n is only 32, no big deal. I would have guessed the average case is log(n) (i.e. 5) because each pass is expected to remove half of the candidates on average. Interestingly, my empirical results show an average of 4.0 iterations. This is perhaps because the two half-warps perform the writes in series with respect to each other.

With such small n, however, the tightness of the loop is at least as important as the order, so this will crush a more complex algorithm that guarantees order(n).

When I find a few free minutes to spare, I think maybe I will extend this to a arbitrary sized min() reduction (no promises). The point being that using the conventional reduction algorithm is very inefficient for this kind of operation.

The moral of the story: Learn what you can about theory and established algorithms, but don’t blindly follow them. Sometimes the practical details of the problem at hand trumps theory.

Ken Seehart (ken@seehart.com)

I use the following, which works pretty well. The obvious race condition makes it appear nonsensical at first glance, but closer inspection reveals that although the behavior of the logic path is undefined, the final result is deterministic and correct.

double my_value;

__shared__ double min_value;

min_value = 1.7976931348623158e+308

my_value = compute_my_value(...);

while (my_value<min_value)

{

    min_value = my_value;

}

Worst case is n, but since n is only 32, no big deal. I would have guessed the average case is log(n) (i.e. 5) because each pass is expected to remove half of the candidates on average. Interestingly, my empirical results show an average of 4.0 iterations. This is perhaps because the two half-warps perform the writes in series with respect to each other.

With such small n, however, the tightness of the loop is at least as important as the order, so this will crush a more complex algorithm that guarantees order(n).

When I find a few free minutes to spare, I think maybe I will extend this to a arbitrary sized min() reduction (no promises). The point being that using the conventional reduction algorithm is very inefficient for this kind of operation.

The moral of the story: Learn what you can about theory and established algorithms, but don’t blindly follow them. Sometimes the practical details of the problem at hand trumps theory.

Ken Seehart (ken@seehart.com)

// Reducing 64 values located in 64 element long shared memory vector 'smem'

__device__ void warp_reduce_min(volatile float smem[64])

{

	smem[threadIdx.x] = smem[threadIdx.x+32] < smem[threadIdx.x] ? 

						smem[threadIdx.x+32] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+16] < smem[threadIdx.x] ? 

						smem[threadIdx.x+16] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+8] < smem[threadIdx.x] ? 

						smem[threadIdx.x+8] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+4] < smem[threadIdx.x] ? 

						smem[threadIdx.x+4] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+2] < smem[threadIdx.x] ? 

						smem[threadIdx.x+2] : smem[threadIdx.x];

	smem[threadIdx.x] = smem[threadIdx.x+1] < smem[threadIdx.x] ? 

						smem[threadIdx.x+1] : smem[threadIdx.x]; 

}

EDIT: The min value of the array will be located in smem[0].

// Reducing 64 values located in 64 element long shared memory vector 'smem'

__device__ void warp_reduce_min(volatile float smem[64])

{

	smem[threadIdx.x] = smem[threadIdx.x+32] < smem[threadIdx.x] ? 

						smem[threadIdx.x+32] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+16] < smem[threadIdx.x] ? 

						smem[threadIdx.x+16] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+8] < smem[threadIdx.x] ? 

						smem[threadIdx.x+8] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+4] < smem[threadIdx.x] ? 

						smem[threadIdx.x+4] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+2] < smem[threadIdx.x] ? 

						smem[threadIdx.x+2] : smem[threadIdx.x];

	smem[threadIdx.x] = smem[threadIdx.x+1] < smem[threadIdx.x] ? 

						smem[threadIdx.x+1] : smem[threadIdx.x]; 

}

EDIT: The min value of the array will be located in smem[0].

Have you tried using atomicMin()? It might turn out to be just as fast as this.

Have you tried using atomicMin()? It might turn out to be just as fast as this.

External Image

But wait, that’s not all!

double my_value;

__shared__ double min_value;

my_value = compute_my_value(...);

min_value = my_value;

while (my_value<min_value)

{

    min_value = my_value;

}

Immediately assigning my_value to min_value saves one iteration. AtomicInc works with integers only, so wouldn’t work in my case. Not sure if it would be faster for the integer case… I’ll try a speed comparison…

External Image

But wait, that’s not all!

double my_value;

__shared__ double min_value;

my_value = compute_my_value(...);

min_value = my_value;

while (my_value<min_value)

{

    min_value = my_value;

}

Immediately assigning my_value to min_value saves one iteration. AtomicInc works with integers only, so wouldn’t work in my case. Not sure if it would be faster for the integer case… I’ll try a speed comparison…

For scanning very small shared memory arrays for minimum/maximum I have found that the clasical reduction tree may not be the best way because of the larger number of shared memory accesses. What seemed to work better (at least in the context my code) was to have threads 0 and 1 in the warp scan the odd and even elements respectively, keeping the current minimum/maximum in a register. Then after the per-thread loop expired, I had the two threads write out the results to shared memory. After the next __syncthreads(), I then had all threads pick the final minimum/maximum from among the two values stored. Variations of this technique are obviously possible, for example by using three instead of two threads.

Which technique works best will depend on the size of the array, the target architecture, and the code context, so some experimentation may be in order to find the best solution for a given situation. The general thing to keep in mind is that classical binary tree reduction may not alway be the highest performance option for special case scenarios.

For scanning very small shared memory arrays for minimum/maximum I have found that the clasical reduction tree may not be the best way because of the larger number of shared memory accesses. What seemed to work better (at least in the context my code) was to have threads 0 and 1 in the warp scan the odd and even elements respectively, keeping the current minimum/maximum in a register. Then after the per-thread loop expired, I had the two threads write out the results to shared memory. After the next __syncthreads(), I then had all threads pick the final minimum/maximum from among the two values stored. Variations of this technique are obviously possible, for example by using three instead of two threads.

Which technique works best will depend on the size of the array, the target architecture, and the code context, so some experimentation may be in order to find the best solution for a given situation. The general thing to keep in mind is that classical binary tree reduction may not alway be the highest performance option for special case scenarios.

If minimum operation is done on 32-bit data, then one can try LDS.128 on Fermi (sm_20, sm_21).

The following code demonstrates the idea

#include <stdio.h>

#include <assert.h>

#include <cuda_runtime.h>

#define WARP_SIZE 32

__global__  void fast_min_per_warp( float *A, float *minA )

{

    volatile __shared__ float As_f[4*WARP_SIZE] ;

    int tid = threadIdx.x ;

    As_f[tid] = A[tid];

__syncthreads();

float4 *As_f4 = (float4 *)As_f ;

// reduction by LDS.128 on Fermi 

    float4  a_reg ;

    float b, c ;

a_reg  = As_f4[tid] ;

    b = min(a_reg.x, a_reg.y);

    c = min(a_reg.z, a_reg.w);

    As_f[tid] = min(b, c);

a_reg  = As_f4[tid] ;

    b = min(a_reg.x, a_reg.y);

    c = min(a_reg.z, a_reg.w);

    As_f[tid] = min(b, c);

// final two elements

    As_f[tid] = min(As_f[tid], As_f[tid+1]) ;

__syncthreads();

if (0 == tid){

        minA[0] = As_f[0];

    }

}

int main(int argc, char*argv[])

{

    float *h_A ;

    float *d_A ;

    float *h_minA;

    float *d_minA;

    cudaError_t status ;

h_A = (float*) malloc(sizeof(float)*WARP_SIZE);

    assert( NULL != h_A );

    h_minA = (float*) malloc(sizeof(float));

    assert( NULL != h_minA );

status = cudaMalloc((void**)&d_A, sizeof(float)*WARP_SIZE);

    assert(cudaSuccess == status);

status = cudaMalloc((void**)&d_minA, sizeof(float));

    assert(cudaSuccess == status);

for(int i = 0 ; i < WARP_SIZE ; i++){

        h_A[i] = (float)(WARP_SIZE+1-i) ;

        printf("h_A[%d] = %f\n", i, h_A[i]);

    }

status = cudaMemcpy(d_A, h_A, sizeof(float)*WARP_SIZE, cudaMemcpyHostToDevice) ;

    assert(cudaSuccess == status);

fast_min_per_warp<<< 1, WARP_SIZE >>>(d_A, d_minA);

status = cudaMemcpy(h_minA, d_minA, sizeof(float), cudaMemcpyDeviceToHost) ;

    assert(cudaSuccess == status);

printf("min(A) = %f\n", *h_minA);

}

if you use cuobjdump, then only 15 instructions + 2 __syncthreads() are enough to take minimum of 32 32-bit elements.

If minimum operation is done on 32-bit data, then one can try LDS.128 on Fermi (sm_20, sm_21).

The following code demonstrates the idea

#include <stdio.h>

#include <assert.h>

#include <cuda_runtime.h>

#define WARP_SIZE 32

__global__  void fast_min_per_warp( float *A, float *minA )

{

    volatile __shared__ float As_f[4*WARP_SIZE] ;

    int tid = threadIdx.x ;

    As_f[tid] = A[tid];

__syncthreads();

float4 *As_f4 = (float4 *)As_f ;

// reduction by LDS.128 on Fermi 

    float4  a_reg ;

    float b, c ;

a_reg  = As_f4[tid] ;

    b = min(a_reg.x, a_reg.y);

    c = min(a_reg.z, a_reg.w);

    As_f[tid] = min(b, c);

a_reg  = As_f4[tid] ;

    b = min(a_reg.x, a_reg.y);

    c = min(a_reg.z, a_reg.w);

    As_f[tid] = min(b, c);

// final two elements

    As_f[tid] = min(As_f[tid], As_f[tid+1]) ;

__syncthreads();

if (0 == tid){

        minA[0] = As_f[0];

    }

}

int main(int argc, char*argv[])

{

    float *h_A ;

    float *d_A ;

    float *h_minA;

    float *d_minA;

    cudaError_t status ;

h_A = (float*) malloc(sizeof(float)*WARP_SIZE);

    assert( NULL != h_A );

    h_minA = (float*) malloc(sizeof(float));

    assert( NULL != h_minA );

status = cudaMalloc((void**)&d_A, sizeof(float)*WARP_SIZE);

    assert(cudaSuccess == status);

status = cudaMalloc((void**)&d_minA, sizeof(float));

    assert(cudaSuccess == status);

for(int i = 0 ; i < WARP_SIZE ; i++){

        h_A[i] = (float)(WARP_SIZE+1-i) ;

        printf("h_A[%d] = %f\n", i, h_A[i]);

    }

status = cudaMemcpy(d_A, h_A, sizeof(float)*WARP_SIZE, cudaMemcpyHostToDevice) ;

    assert(cudaSuccess == status);

fast_min_per_warp<<< 1, WARP_SIZE >>>(d_A, d_minA);

status = cudaMemcpy(h_minA, d_minA, sizeof(float), cudaMemcpyDeviceToHost) ;

    assert(cudaSuccess == status);

printf("min(A) = %f\n", *h_minA);

}

if you use cuobjdump, then only 15 instructions + 2 __syncthreads() are enough to take minimum of 32 32-bit elements.

Revised edition:

__global__ void cu_find_minima(double *A, int *minima)

{

    // find the minimum for each block

    const int i = threadIdx.x;

    const int j = blockIdx.x;

    const int t = j*blockDim.x + i;

double my_value;

    __shared__ double min_value;

// Read value from input array - note that in many scenarios, this would be replaced

    // by a computation of some kind or a serial min of several computations

my_value = A[t]; // A[j][i]

min_value = my_value;

    syncthreads(); // can be omitted in a single warp scenario

while (my_value<min_value)

    {

        // race condition: multiple writes to min_value, but we don't care!

        min_value = my_value;

        syncthreads(); // can be omitted in a single warp scenario, but for multiple warps this prevents race to the conditional

    }

if (my_value==min_value)

    {

        minima[j] = i; // save the minimum for each block

    }

}

Adding syncthreads() generalizes my previous post to work with an arbitrary number of warps instead of a single warp.

If syncthreads is omitted, this only works works when there is only one warp, (which was the original stated problem). This is because a race condition can cause min_value to be replaced by a larger value after the conditional is executed. As previously noted, the race condition for the assignment itself can be safely ignored.

I believe this will run faster than the other suggestions, particularly in the case where the values to be scanned are computed per thread in the kernel. It performs fewer than log[sub]2/sub parallel passes (assuming sufficient number of cores) and each pass only performs one compare and one conditional write. After the first pass, most of the half warps will unanimously skip the write operation. Normally we don’t consider a conditional to be beneficial, but when all threads in a half-warp agree, the conditional actually provides a significant savings. Also, there is no excess overhead associated with a temporary storage buffer, which is important if the minimum calculation is being performed on values calculated per-thread by the kernel as opposed to the operation being performed on input data.

LSChien’s solution seems to come closest with 15 instructions + 2 __syncthreads(). The above code encounters an average of 4 syncthreads, however in the single warp case, I omit the syncthreads, and average case is 8 operations (4 compares and 4 writes). Admittedly, worst case is 32 compares and 32 writes, but I’m more concerned with average case. If I encountered a pathological scenario where the data tends to be sorted in an unfavorable manner, I would simply reverse the thread order and expect an even lower average case!

As njuffa pointed out, classical reduction is not optimal for this situation. Min is not the same as Add, so you should not have to work as hard to do it.

Another variation, per njuffa’s comments, each thread could serially compute (or read from an input buffer) several values, keeping track of the minimum, then proceed to the above code.

Yet another variation: if all but one MPU can be kept busy on other streams, we can dedicate one warp to the min operation, each thread computing 1 of every 32 values in series, and then apply the above solution with syncthreads() omitted (because syncthreads is not necessary for the single warp scenario).

There seems to be some weird magic going on here since a basic understanding of reductions tells us not to expect better log[sub]2/sub=5 passes. Note that I am counting the initial write and the final compare in addition to 3 passes through the inner loop, for a total of 4 compares and 4 writes.

So how does this algorithm beat log[sub]2/sub? The answer is that we are gambling, but the odds are stacked in our favor. In each pass we are just as likely to have good luck as bad luck, but good luck yields more benefit than the bad luck costs. I haven’t worked out the mathematics (exercise left for the reader) but I get the following empirical result:

k = 2/3 * (1 + log[sub]2/sub)

where n is the number of threads (with one value per thread) and k is the average number of parallel passes required.

Enjoy,

Ken