cublas snrm2 improvement?

One of the steps of a computation I perform is a 2D float32 matrix norm. I have done it several ways, including using blas snrm2. It is orders of magnitude slower than other steps (including an fft3 of a much larger matrix) and therefore dominates the whole computation, which would otherwise be 2 orders of magnitude faster.

This is using cuda 3.0 on Mac OS and Linux with different graphics cards.

I can understand how a reduction operation can be slow but I think I must be doing something wrong. Does anyone have any hints on what to look for? Is this just a quirk of the implementation and are there workarounds, or is this a fundamentally bad thing to perform on a GPU?