I’ve got a very large matrix (15k x 15k) of doubles that I need to go through and count up elements that reach a certain threshold. Is there any way to do this directly on the GPU without having to read out a matrix of equivalent size and for-loop through it?
For example, I only expect about 40 of the elements in the matrix to meet the threshold. I would like to somehow find a way to return just the row+col address of THOSE elements. Currently I’m thinking I need to return a 15k x 15k matrix with, let’s say, a value of 1 within each element that meets the threshold and then parse through it with a for loop to find out which elements are of interest.
The speed up GPU side is obvious but, as far as I can tell, I can’t just have each thread access the same variable since it’s all async-ed. Is there any mechanism to allow threads to append to, meaningfully, the same array variable?