Most efficient blockmin function?

Raystonn · April 5, 2009, 7:27am

I’m trying to write the most efficient function to find the minimum value of a variable across all threads of a block.

Does anyone see any room for improvement here? It would be nice if we had an intrinsic for this.

Thanks

unsigned int blockmin(unsigned int data)

{

	__shared__ unsigned int datum[BLOCK_SIZE];

	__shared__ int zeroflag;

	unsigned int i;

	//

	// Check for zero on quick path

	//

	zeroflag = 0;

	__syncthreads();

	if (data == 0)

	{

		zeroflag = 1;

	}

	__syncthreads();

	if (zeroflag)

	{

		return 0;

	}

	//

	// Zero not present on any thread.  Get min of all threads' values.

	//

	datum[threadIdx.x] = data;

	__syncthreads();

	for(unsigned int i=BLOCK_SIZE/2; i>0; i>>=1)

	{

		if (threadIdx.x < i)

		{

			if (datum[threadIdx.x + i] < datum[threadIdx.x])

			{

				datum[threadIdx.x] = datum[threadIdx.x + i];

			}

		}

		__syncthreads();

	}

	return datum[0];

}

Raystonn · April 5, 2009, 7:35am

I’m thinking that even if an intrinsic for blockmin is impossible due to thread syncing issues, one for warps rather than blocks could be very useful. In fact, that would reduce a blockmin implementation down to a single value to be checked per warp rather than one value to be checked per thread. That would make a very nice speedup.

Anyhow, does anyone see anything I’m missing in speeding this up?

Thanks

SPWorley · April 5, 2009, 8:04am

What you’re looking for is a common pattern in CUDA called a reduction. It can be done in parallel much more efficiently than a loop like your code above.

Check out the CUDA SDK’s reduction example… it’s extremely well documented.

Raystonn · April 5, 2009, 8:06am

I’m trying to figure out if there is some way to use the vote intrinsics _any or _all to do a warp min to speed this up.

Thanks for the reduction suggestion.

Raystonn · April 5, 2009, 8:41am

I have updated the code on the original post to use reduction. But I’m still looking for a better way to do this. Is there any way to do a warp min without using shared memory?

MisterAnderson42 · April 5, 2009, 1:44pm

No, threads cannot communicate except through shared memory.

Raystonn · April 5, 2009, 4:57pm

Not quite true. The vote intrinsics _any and _all can be used to gather information from all threads in a warp. I’m just trying to figure out if there is a way to use them for a warp-wide min function.

E.D_Riedijk · April 5, 2009, 6:30pm

Well, you can find out if any of the threads in the warp are smaller than the current minimum. But if you can use that to make a faster min than a standard reduction? I have the idea not. In general the more simple and stupid the code is, the faster it performs ;)

seibert · April 5, 2009, 7:17pm

Out of curiosity, does anyone know of a simple (but practical) algorithm that benefits from the vote intrinsics? I keep thinking I understand where to use them, but then end up with something that doesn’t actually need warp voting.

Raystonn · April 5, 2009, 9:56pm

Well, if we had vote min and max I think I’ve demonstrated the use case. :)

As far as vote any/all, they allow you to test if a certain condition would cause a branch divergence without actually causing a branch divergence. Any of the vote intrinsics basically allows fast interthread communication without the use of shared memory.

seibert · April 5, 2009, 10:44pm

Right, I’ve got that part already. I’m looking for a simple algorithm where being able to construct a predicate across a warp (but not a block) helps you. I have not come up with one thinking about the sorts of problems I usually solve. Seeing a real use for warp voting might help me understand where it would be beneficial.

Raystonn · April 5, 2009, 11:14pm

I don’t have any examples for you that would require vote any/all, just vote min/max. In fact, vote min/max can be used to implement vote any/all. The former are more fundamental than the latter.

Jamie_K · April 6, 2009, 3:31am

I had a hard time thinking of appilcations but I think this might be one:

Suppose you an operation, where each thread takes a sequence of inputs, and depending on some predicate, may or may not produce an output. Each thread appends its output, if any, to its own array, so threads don’t step on each other. Writes will generally be uncoalesced. Perhaps outputs could be queued in shared memory, and then when __any() of the threads fills its queue, they flush the queued elements to global memory. If the output is allowed to have gaps, then these could be coalesced writes.

Somewhat contrived, I know, but I think it could provide an advantage in this case.

As for min/max, it is possible to implement in terms of any/all, but it would be tremendously slow because it would have to test each bit individually. Warp any/all functions could also be implemented in terms of warp-add, or bitwise warp-and and warp-or (and therefore min/max could be implemented in terms of these as well).

Topic		Replies	Views
Fastest single-warp min() algorithm for 32 values CUDA Programming and Performance	15	3863	June 20, 2011
inter-warp min value CUDA Programming and Performance	5	2303	July 20, 2008
Finding Minimum in array CUDA Programming and Performance	6	13021	October 14, 2011
parallel way to find min CUDA Programming and Performance	21	7621	April 15, 2011
Broadcast for all threads CUDA Programming and Performance	11	10004	April 19, 2010
Reduction to find minimum value (__shfl_down) using warp shuffle CUDA Programming and Performance	3	3772	August 3, 2016
Finding minimum among multiple threads CUDA Programming and Performance	13	5531	August 11, 2013
Best way to find many minimums CUDA Programming and Performance	8	2692	January 3, 2018
Is there a block vote (analogous to warp vote?) CUDA Programming and Performance	7	20818	July 20, 2009
WARP Voting function CUDA Programming and Performance	6	6602	March 25, 2010

Most efficient blockmin function?

Related topics