So… I got my new GTX480 and I decided to play around with __ballot() and see if I can speed up some of my algorithms.
Most obvious is using it to speed up my reduction. I use it to sum up binary values (true/false), so it seems ideal.
Benchmarking seems positive, makes a noticeable difference in my application. Sorry, don’t have hard numbers for you, but considering that reduction is not the most expensive part of my application, it points towards this being a significant performance enhancement.
As a note, Warp and thread sizes are #define’d in my application’s headers, so you may need to edit this to suit your uses. The base that I worked off is part of the CUDA SDK Reduction example.
Please comment if you can think of a way to better this.
template<class T>
__device__ void reduction_add(T* result, T add)
{
unsigned int tid = threadIdx.x;
unsigned int wid = threadIdx.x>>5; // Warp ID
__shared__ T sMem[CUDAWARPS];
__syncthreads();
int res = __popc(__ballot(add)); // Magic happens here
if((tid&31)==0) { sMem[wid] = add = res;}
__syncthreads();
if(tid<CUDAWARPS)
{
volatile T *smem = sMem;
add = smem[tid];
if(CUDAWARPS>=16) { smem[tid] = add = add + smem[tid+8]; } // 512 threads
if(CUDAWARPS>=8) { smem[tid] = add = add + smem[tid+4]; } // 256 threads
if(CUDAWARPS>=4) { smem[tid] = add = add + smem[tid+2]; } // 128 threads
if(CUDAWARPS>=2) { smem[tid] = add = add + smem[tid+1]; } // 64 threads
}
if(tid==0) result[0] = sMem[0];
__syncthreads();
}
I considered using ballot reductions for non-binary values as well, such as ballot’ing a bit at a time. So 8 ballot calls to gather a byte. I might experiment when I need this functionality, but for now this is fine. At what point ‘classical’ gather would once again be faster than ballot gather I cannot guess at this point.