I’m trying to write the most efficient function to find the minimum value of a variable across all threads of a block.
Does anyone see any room for improvement here? It would be nice if we had an intrinsic for this.
Thanks
unsigned int blockmin(unsigned int data)
{
__shared__ unsigned int datum[BLOCK_SIZE];
__shared__ int zeroflag;
unsigned int i;
//
// Check for zero on quick path
//
zeroflag = 0;
__syncthreads();
if (data == 0)
{
zeroflag = 1;
}
__syncthreads();
if (zeroflag)
{
return 0;
}
//
// Zero not present on any thread. Get min of all threads' values.
//
datum[threadIdx.x] = data;
__syncthreads();
for(unsigned int i=BLOCK_SIZE/2; i>0; i>>=1)
{
if (threadIdx.x < i)
{
if (datum[threadIdx.x + i] < datum[threadIdx.x])
{
datum[threadIdx.x] = datum[threadIdx.x + i];
}
}
__syncthreads();
}
return datum[0];
}
I’m thinking that even if an intrinsic for blockmin is impossible due to thread syncing issues, one for warps rather than blocks could be very useful. In fact, that would reduce a blockmin implementation down to a single value to be checked per warp rather than one value to be checked per thread. That would make a very nice speedup.
Anyhow, does anyone see anything I’m missing in speeding this up?
What you’re looking for is a common pattern in CUDA called a reduction. It can be done in parallel much more efficiently than a loop like your code above.
Check out the CUDA SDK’s reduction example… it’s extremely well documented.
I have updated the code on the original post to use reduction. But I’m still looking for a better way to do this. Is there any way to do a warp min without using shared memory?
Not quite true. The vote intrinsics _any and _all can be used to gather information from all threads in a warp. I’m just trying to figure out if there is a way to use them for a warp-wide min function.
Well, you can find out if any of the threads in the warp are smaller than the current minimum. But if you can use that to make a faster min than a standard reduction? I have the idea not. In general the more simple and stupid the code is, the faster it performs ;)
Out of curiosity, does anyone know of a simple (but practical) algorithm that benefits from the vote intrinsics? I keep thinking I understand where to use them, but then end up with something that doesn’t actually need warp voting.
Well, if we had vote min and max I think I’ve demonstrated the use case. :)
As far as vote any/all, they allow you to test if a certain condition would cause a branch divergence without actually causing a branch divergence. Any of the vote intrinsics basically allows fast interthread communication without the use of shared memory.
Right, I’ve got that part already. I’m looking for a simple algorithm where being able to construct a predicate across a warp (but not a block) helps you. I have not come up with one thinking about the sorts of problems I usually solve. Seeing a real use for warp voting might help me understand where it would be beneficial.
I don’t have any examples for you that would require vote any/all, just vote min/max. In fact, vote min/max can be used to implement vote any/all. The former are more fundamental than the latter.
I had a hard time thinking of appilcations but I think this might be one:
Suppose you an operation, where each thread takes a sequence of inputs, and depending on some predicate, may or may not produce an output. Each thread appends its output, if any, to its own array, so threads don’t step on each other. Writes will generally be uncoalesced. Perhaps outputs could be queued in shared memory, and then when __any() of the threads fills its queue, they flush the queued elements to global memory. If the output is allowed to have gaps, then these could be coalesced writes.
Somewhat contrived, I know, but I think it could provide an advantage in this case.
As for min/max, it is possible to implement in terms of any/all, but it would be tremendously slow because it would have to test each bit individually. Warp any/all functions could also be implemented in terms of warp-add, or bitwise warp-and and warp-or (and therefore min/max could be implemented in terms of these as well).