Is there a block vote (analogous to warp vote?)

zenpharaohs · July 19, 2009, 5:17am

I have just determined that one way to map one of my algorithms onto CUDA is to comptute a lot of results in a matrix, where each matrix element is computed by a thread. Each row in the matrix consists of the threads in one block. I do not actually allocate a matrix, since I can use memory local to each thread to hold this element.

This is only surprising because the usual serial form of this algorithm has no such matrix, but seems a completely normal way to use CUDA.

Now I want to know which rows (i.e. blocks) have all their matrix element equal to zero. (A slightly different formulation would be that I only need to know if there is any row where all the matrix elements are equal to zero).

Obviously, I could have actually allocated a big matrix for all these results, and then set a thread to go through each row, which means I will find out if there are any “all zero” rows pretty quickly, but it does require using more memory, and computing these elements will be faster than passing over a row. So I will give up a time factor of two or more likely worse. Allocating the big matrix will at least double the memory used in this algorithm, but if the algorithm is fast enough, possible worse (the faster this can go, problems with longer rows will be attempted; everthing else in the problem is O(1) space with respect to row length.

If there was a block vote, or if I could map blocks to warps (does one get to do that?) then warp vote, such as all(x == 0), would presumably be faster.

E.D_Riedijk · July 19, 2009, 6:12am

If you make blocks of 32 threads, a warp vote will be a block vote :)

SPWorley · July 19, 2009, 7:45am

__shared__ int blockvote;

blockvote=0;

__syncthreads();

if (myThreadVotesYes) blockvote=1;

__syncthreads();

if (blockvote) { .... // all of block follows this if ANY thread voted yes

  }

And the converse code:

__shared__ int blockvote;

blockvote=1;

__syncthreads();

if (!myThreadVotesYes) blockvote=0;

__syncthreads();

if (blockvote) { .... // all of block follows this if ALL threads voted yes

}

StickGuy · July 19, 2009, 8:22am

Perhaps an easier approach would be to warp voting within each warp in your block and then do a small reduction at the end of the block to combine the votes from the different warps.

zenpharaohs · July 19, 2009, 1:06pm

Good to know, but my block size will be less than eight, unless I re-map the problem, or else nVidia gets some serious hardware cooking.

I can remap the algorithm to put more than one “row” in a block, but they won’t be an even divisor of 32.

zenpharaohs · July 19, 2009, 5:20pm

__shared__ int blockvote;

blockvote=0;

__syncthreads();

if (myThreadVotesYes) blockvote=1;

__syncthreads();

if (blockvote) { .... // all of block follows this if ANY thread voted yes

  }

And the converse code:

__shared__ int blockvote;

blockvote=1;

__syncthreads();

if (!myThreadVotesYes) blockvote=0;

__syncthreads();

if (blockvote) { .... // all of block follows this if ALL threads voted yes

}

I think I understand why these work. When I get back to where I can run CUDA stuff I will try this approach.

If I wanted to know how many blocks voted yes would this work:

__shared__ int blockcount;

blockcount=0;

__syncthreads();

if (myThreadVotesYes) blockcount++;

__syncthreads();

/* blockcount is the number of threads that voted yes */

Then to keep track of which blocks voted yes without an array as big as all the blocks (very few might vote yes)

can I get away with this sort of thing:

__global__ yesBlockList[LOTS];

__shared__ int blockcount;

blockcount=0;

__syncthreads();

if (myThreadVotesYes) {

		blockcount++;

		yesBlockList[blockcount] = Block.Idx;

}

__syncthreads();

Or will I run into order of execution issues?

SPWorley · July 19, 2009, 6:20pm

If I wanted to know how many blocks voted yes would this work:

__shared__ int blockcount;

blockcount=0;

__syncthreads();

if (myThreadVotesYes) blockcount++;

__syncthreads();

/* blockcount is the number of threads that voted yes */

No, this will fail painfully due to thread races.

The solution for this is a block-wide REDUCTION pass. It’s fast, only log2(threadcount).

See the SDK reduction example.

You COULD use shared atomics, too, but there’s rarely any reason too, it’s almost always more efficient to use reduction.

zenpharaohs · July 20, 2009, 10:30pm

As it turns out, I will stick to the simpler form that uses only block vote; I have finished profiling a scalar C version of my problem written in a form that mimics

the CUDA form. According to the profiler, more than 99.99% of the time is spent in the single simple function that I have been referring to here. At the moment the problem fits into about 200KBytes so I am hoping this will be a “best possible” case for CUDA. (Additionally, this form uses only integer arithmetic. There is a double precision version which saves about half the arithmetic, but since it has to be double precision, I doubt that will lead to the best CUDA version).

I thought I had a pretty reasonable Matlab prototype for this problem but so far the hand buffed scalar C code is already something like 100 times faster. Nice though that is, this problem appears to be painfully exponential, so even with a big boost from CUDA we might still be looking at a huge runtime (days - years?).

Topic		Replies	Views
WARP Voting function CUDA Programming and Performance	6	6492	March 25, 2010
warp parallelism How to decreasing the size of a parallel element CUDA Programming and Performance	3	2036	March 17, 2009
Most efficient blockmin function? CUDA Programming and Performance	12	4916	April 6, 2009
A (not so) hypothetical question CUDA Programming and Performance	6	1640	March 24, 2009
How to implement a recursive combination finder in CUDA or OpenCL for large datasets? CUDA Programming and Performance cuda , kernel	7	44	January 17, 2025
Efficient use of shared memory CUDA Programming and Performance	29	4465	December 2, 2019
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5830	August 7, 2008
Array Comparision CUDA Programming and Performance	4	4241	May 31, 2009
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7010	January 30, 2008
Iteration help in CUDA CUDA Programming and Performance	11	6887	April 19, 2012

Is there a block vote (analogous to warp vote?)

Related topics