I have just determined that one way to map one of my algorithms onto CUDA is to comptute a lot of results in a matrix, where each matrix element is computed by a thread. Each row in the matrix consists of the threads in one block. I do not actually allocate a matrix, since I can use memory local to each thread to hold this element.

This is only surprising because the usual serial form of this algorithm has no such matrix, but seems a completely normal way to use CUDA.

Now I want to know which rows (i.e. blocks) have all their matrix element equal to zero. (A slightly different formulation would be that I only need to know if there is any row where all the matrix elements are equal to zero).

Obviously, I could have actually allocated a big matrix for all these results, and then set a thread to go through each row, which means I will find out if there are any “all zero” rows pretty quickly, but it does require using more memory, and computing these elements will be faster than passing over a row. So I will give up a time factor of two or more likely worse. Allocating the big matrix will at least double the memory used in this algorithm, but if the algorithm is fast enough, possible worse (the faster this can go, problems with longer rows will be attempted; everthing else in the problem is O(1) space with respect to row length.

If there was a block vote, or if I could map blocks to warps (does one get to do that?) then warp vote, such as all(x == 0), would presumably be faster.