I have been trying to find a good method of calculating grid wide aggregate variables in cuda, but have run into some complications. Does anyone know of a good method to calculate such variables?
I was attempting to have each block of threads write their data to a global variable, and then combine those global variables. I have not been able to find a way to ensure that all threads are able to write their data to the variable. I expected things to slow down while threads waited to add their value, but instead many of the threads are just skipped.