Your problem is not very easy for parallel computing, because the calculation of the average value can’t be done parallel easily. You cannot let all threads just write on the same position…
Maybe it’s best to use the cpu, or you calculate the average value by the host and do the rest on the gpu. But I guess with all the data transfer that won’t be very good. Maybe someone can explain you how to make a reduction for the calculation of the average… (I can’t because I never did that)
I’d say that your best bet is to have a kernel that calculates the average value for each row and stores it in an array in global memory (i.e. ‘reduces’ the matrix to a column vector), then have another kernel that reads the average value for each row and subtracts it from each element in the row (i.e. subtracting that column vector from each column vector in the matrix and storing it back in the matrix column).