Ballot Based Reduction

So… I got my new GTX480 and I decided to play around with __ballot() and see if I can speed up some of my algorithms.

Most obvious is using it to speed up my reduction. I use it to sum up binary values (true/false), so it seems ideal.

Benchmarking seems positive, makes a noticeable difference in my application. Sorry, don’t have hard numbers for you, but considering that reduction is not the most expensive part of my application, it points towards this being a significant performance enhancement.

As a note, Warp and thread sizes are #define’d in my application’s headers, so you may need to edit this to suit your uses. The base that I worked off is part of the CUDA SDK Reduction example.

Please comment if you can think of a way to better this.

template<class T>

__device__ void reduction_add(T* result, T add)

{

    unsigned int tid = threadIdx.x;

    unsigned int wid = threadIdx.x>>5; // Warp ID

    __shared__ T sMem[CUDAWARPS];

	__syncthreads();

int res = __popc(__ballot(add)); // Magic happens here

    if((tid&31)==0) { sMem[wid] = add = res;}

__syncthreads();

	if(tid<CUDAWARPS)

	{

		volatile T *smem = sMem;

		add = smem[tid];

		if(CUDAWARPS>=16) { smem[tid] = add = add + smem[tid+8]; } // 512 threads

		if(CUDAWARPS>=8) { smem[tid] = add = add + smem[tid+4]; } // 256 threads

		if(CUDAWARPS>=4) { smem[tid] = add = add + smem[tid+2]; } // 128 threads

		if(CUDAWARPS>=2) { smem[tid] = add = add + smem[tid+1]; } // 64 threads

	}

    if(tid==0) result[0] = sMem[0];

	__syncthreads();

}

I considered using ballot reductions for non-binary values as well, such as ballot’ing a bit at a time. So 8 ballot calls to gather a byte. I might experiment when I need this functionality, but for now this is fine. At what point ‘classical’ gather would once again be faster than ballot gather I cannot guess at this point.