Point-wise Mean Best approach?

I have several arrays of size N. I will be computing the point-wise average (mean) of a predetermined number of these, M. So I have M arrays of size N that I need to average together to produce one array of size N.

I know that division is considered very costly to perform on the GPU. I was thinking that the best approach may be to multiply each element in the M arrays by (1/M), and then doing simple additions to compute averages. Would I be better off doing this 1/M multiplication on the CPU before moving the arrays to the GPU for averaging? Does a function exist for the GPU to perform division by scalar optimally?

It kind of depends what you call costly. IMHO I guess your algorithm will be memory bandwidth bound anyways. In that case you shouldn’t see the devision affect the performance anyways. When doing sums of large M you should watch your precision, so.

Note that it is only integer divides that are relatively slow, floating point division is very fast.