I’m trying to find the fastest way to solve a max value problem.

The function itself is insanely simple… something about like this ( add a bit of robustness to handle non base 2 counts )

void max( float * src, float * dst )

{

unsigned int = blockIdx.x*blockDim.x + threadIdx.x;
dest[ pos ] = ( src[ pos*2 ] > src[ pos

*2 + 1] ) ? src[ pos*2 ] : src[ pos*2 + 1];

}

once out of the function just swap the src and dest pointers, recalculate blocks and threads and call the function again. This of couse would loop dividing the total thread and block number by 2.

So, for instance, say i have 65536 values to diff.

How do I figure out the trade-offs between just running 32768 blocks with 1 thread each vs running 128 blocks with 512 threads.

Do the threads all run concurently or do they share processor time just like a normal x86 chip would? I think I keep glazing over the part that specifically points that little fact out…

Thanks for your time,

–Troy