Min Max problem in parallel

544d4e · September 25, 2008, 3:20am

I’m trying to find the fastest way to solve a max value problem.

The function itself is insanely simple… something about like this ( add a bit of robustness to handle non base 2 counts )

void max( float * src, float * dst )
{
unsigned int = blockIdx.xblockDim.x + threadIdx.x;
dest[ pos ] = ( src[ pos2 ] > src[ pos2 + 1] ) ? src[ pos2 ] : src[ pos*2 + 1];
}

once out of the function just swap the src and dest pointers, recalculate blocks and threads and call the function again. This of couse would loop dividing the total thread and block number by 2.

So, for instance, say i have 65536 values to diff.
How do I figure out the trade-offs between just running 32768 blocks with 1 thread each vs running 128 blocks with 512 threads.

Do the threads all run concurently or do they share processor time just like a normal x86 chip would? I think I keep glazing over the part that specifically points that little fact out…

Thanks for your time,
–Troy

jack · September 25, 2008, 3:33am

Take a look at the ‘reduction’ sample project in the CUDA SDK. It currently does sum reduction (IIRC), but I bet you could tweak it pretty easily to do a parallelized max() for you.

Reimar · September 25, 2008, 6:48am

Fully agreed, the algorithm you provide is completely unsuitable, it has a massive amount of global memory access and it does uncoalesced reads.

Also you can not really have only one thread per block, a warp (32) is the minimum unit, and as I understand a lower number of threads is basically “emulated” by disabling the other threads.

This in turn means that in each step you should at least reduce your array by a factor of 32, though as said look at the reduction examples (there is also some detailed documentation about the many optimizations used somewhere), they have all the details.

And in my experience on a 8800 GTX, if running your code over 512000 elements takes more than 150 us you are doing something wrong (that is the number I get for a reduction with float2 elements and emulated double-precision).

Topic		Replies	Views
Newbie questions :) CUDA Programming and Performance	7	4328	November 18, 2007
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2364	December 9, 2011
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10268	July 7, 2009
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8351	February 12, 2008
Finding max in array CUDA Programming and Performance	15	42292	November 26, 2017
maximum thread numbers CUDA Programming and Performance	5	12072	October 4, 2011
One question regarding shared memory CUDA Programming and Performance	5	1237	April 24, 2013
parallel maximum detection bad performance CUDA Programming and Performance	11	3534	June 5, 2008
about finding a max number from a big array CUDA Programming and Performance	10	4691	January 7, 2020
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	593	June 2, 2011

Min Max problem in parallel

Related topics