I have just determined that one way to map one of my algorithms onto CUDA is to comptute a lot of results in a matrix, where each matrix element is computed by a thread. Each row in the matrix consists of the threads in one block. I do not actually allocate a matrix, since I can use memory local to each thread to hold this element.
This is only surprising because the usual serial form of this algorithm has no such matrix, but seems a completely normal way to use CUDA.
Now I want to know which rows (i.e. blocks) have all their matrix element equal to zero. (A slightly different formulation would be that I only need to know if there is any row where all the matrix elements are equal to zero).
Obviously, I could have actually allocated a big matrix for all these results, and then set a thread to go through each row, which means I will find out if there are any “all zero” rows pretty quickly, but it does require using more memory, and computing these elements will be faster than passing over a row. So I will give up a time factor of two or more likely worse. Allocating the big matrix will at least double the memory used in this algorithm, but if the algorithm is fast enough, possible worse (the faster this can go, problems with longer rows will be attempted; everthing else in the problem is O(1) space with respect to row length.
If there was a block vote, or if I could map blocks to warps (does one get to do that?) then warp vote, such as all(x == 0), would presumably be faster.
__shared__ int blockvote;
blockvote=0;
__syncthreads();
if (myThreadVotesYes) blockvote=1;
__syncthreads();
if (blockvote) { .... // all of block follows this if ANY thread voted yes
}
And the converse code:
__shared__ int blockvote;
blockvote=1;
__syncthreads();
if (!myThreadVotesYes) blockvote=0;
__syncthreads();
if (blockvote) { .... // all of block follows this if ALL threads voted yes
}
Perhaps an easier approach would be to warp voting within each warp in your block and then do a small reduction at the end of the block to combine the votes from the different warps.
I think I understand why these work. When I get back to where I can run CUDA stuff I will try this approach.
If I wanted to know how many blocks voted yes would this work:
__shared__ int blockcount;
blockcount=0;
__syncthreads();
if (myThreadVotesYes) blockcount++;
__syncthreads();
/* blockcount is the number of threads that voted yes */
Then to keep track of which blocks voted yes without an array as big as all the blocks (very few might vote yes)
can I get away with this sort of thing:
__global__ yesBlockList[LOTS];
__shared__ int blockcount;
blockcount=0;
__syncthreads();
if (myThreadVotesYes) {
blockcount++;
yesBlockList[blockcount] = Block.Idx;
}
__syncthreads();
As it turns out, I will stick to the simpler form that uses only block vote; I have finished profiling a scalar C version of my problem written in a form that mimics
the CUDA form. According to the profiler, more than 99.99% of the time is spent in the single simple function that I have been referring to here. At the moment the problem fits into about 200KBytes so I am hoping this will be a “best possible” case for CUDA. (Additionally, this form uses only integer arithmetic. There is a double precision version which saves about half the arithmetic, but since it has to be double precision, I doubt that will lead to the best CUDA version).
I thought I had a pretty reasonable Matlab prototype for this problem but so far the hand buffed scalar C code is already something like 100 times faster. Nice though that is, this problem appears to be painfully exponential, so even with a big boost from CUDA we might still be looking at a huge runtime (days - years?).