Hi, I am calculating the sum of all the elements in a row from shared memory and storing the updated value in the device memory. My code is as follows:
Sum_Row += AS(ty, tx); Sum_Row += BS(ty, tx); __syncthreads();
“AS”, “BS” are on shared memory
“Sum_Row”, “Sum_Row” are device memory arrays
By copying back “Sum_Row” to host memory I verified the results and found that my program in EmuDebug mode give correct results but not in Debug mode. I also noticed that the value from Debug mode is the sum of last two elements of the last row. This computation is performed on a (N,1) grid.
Is this call for adding the elements is not valid in GPU programming. If no, can you please post correct procedure else please suggest.
Appreciate your response