the design pattern to use here is a parallel reduction. You can find numerous examples and discussions of this on the web, and the CUDA samples include a parallel reduction sample code, along with an accompanying presentation. Google “Mark Harris reduction”

So it seems you are computing the element-wise difference of vectors ‘productCoords’ and ‘points’, then computing the norm of the resulting vector?

If the vectors are quite long, you could apply an element-wise subtraction, then apply cublasSnrm2() to compute the norm of the difference vector, which includes guards against overflow and underflow in intermediate computations. This two-pass approach may not be the fastest, however. And cublasSnrm2() is over-kill if the vectors are short.