Implementing Counter within GPU

I’ve got a very large matrix (15k x 15k) of doubles that I need to go through and count up elements that reach a certain threshold. Is there any way to do this directly on the GPU without having to read out a matrix of equivalent size and for-loop through it?

For example, I only expect about 40 of the elements in the matrix to meet the threshold. I would like to somehow find a way to return just the row+col address of THOSE elements. Currently I’m thinking I need to return a 15k x 15k matrix with, let’s say, a value of 1 within each element that meets the threshold and then parse through it with a for loop to find out which elements are of interest.

The speed up GPU side is obvious but, as far as I can tell, I can’t just have each thread access the same variable since it’s all async-ed. Is there any mechanism to allow threads to append to, meaningfully, the same array variable?

You do need to examine every element. This isn’t hard, though. Your speed will be completely memory bandwidth limited, assuming the data is already on the GPU. (If the data is on the host, then this is a case where you should just use the CPU since the PCIe transfer speeds of 5GB/sec will be slower than the CPU compute).

The real question you’re asking is how to report results. There are many options, and the general solution is to use a compaction to return the sparse set you select.

But since you specifically say you’re not going to be returning many, then you can use the easier method of atomics, which won’t have a performance problem if there’s only a sparse number of return values (say < 10%)

Pass in a pointer to a results array and a pointer to a counter value.

Your core loop will look something like this:

int tid=threadIdx.x + blockIdx.x*blockDim.x;

while (tid < arraySize)  {

   if (array[tid]>threshold) { 

	  int reportindex=atomicAdd(counterptr, 1);

	  results[reportindex]=tid; 

	}

   tid+=gridDim.x*blockDim.x;

}

After this is done, the counter holds how many values you found, and the results array holds those results.

You could fancy this up for 2D array indexing, reporting the value as well as the index, etc.

You may also use Shared Memory to store results for each MP, and then use Atomic Operations as needed to allocate output buffer space for each MP when it’s buffer is full, you will avoid generating 22.5 Millions of Atomic operations :-)

22.5 Millions of atomic operations, for 15k x 15k array, means 400 x 22.5 Millions SEQUENTIAL cycles approximatively, 7 billion GPU cycles, so you will end up with more than 4 seconds only for Atomic Operations!

The threshold will only be hit for about 40 of values, so the overhead of 40 atomicAdd calls is negligible.