Min Max problem in parallel

I’m trying to find the fastest way to solve a max value problem.

The function itself is insanely simple… something about like this ( add a bit of robustness to handle non base 2 counts )

void max( float * src, float * dst )
{
unsigned int = blockIdx.xblockDim.x + threadIdx.x;
dest[ pos ] = ( src[ pos
2 ] > src[ pos2 + 1] ) ? src[ pos2 ] : src[ pos*2 + 1];
}

once out of the function just swap the src and dest pointers, recalculate blocks and threads and call the function again. This of couse would loop dividing the total thread and block number by 2.

So, for instance, say i have 65536 values to diff.
How do I figure out the trade-offs between just running 32768 blocks with 1 thread each vs running 128 blocks with 512 threads.

Do the threads all run concurently or do they share processor time just like a normal x86 chip would? I think I keep glazing over the part that specifically points that little fact out…

Thanks for your time,
–Troy

Take a look at the ‘reduction’ sample project in the CUDA SDK. It currently does sum reduction (IIRC), but I bet you could tweak it pretty easily to do a parallelized max() for you.

Fully agreed, the algorithm you provide is completely unsuitable, it has a massive amount of global memory access and it does uncoalesced reads.

Also you can not really have only one thread per block, a warp (32) is the minimum unit, and as I understand a lower number of threads is basically “emulated” by disabling the other threads.

This in turn means that in each step you should at least reduce your array by a factor of 32, though as said look at the reduction examples (there is also some detailed documentation about the many optimizations used somewhere), they have all the details.

And in my experience on a 8800 GTX, if running your code over 512000 elements takes more than 150 us you are doing something wrong (that is the number I get for a reduction with float2 elements and emulated double-precision).