Pairwise distance in CUDA

Are there any recent GPU (e.g., CUDA, OpenCL) implementations (or included in libraries?) of pairwise distance calculation?

The only thing I could find was this paper from 2008, that shows two implementations in CUDA.
http://www.gpucomputing.net/sites/default/files/papers/904/Chang_etal_CBB2008_634-017.pdf

In fact I would need not only the pairwise distance calculation but also a sum of the pairwise distances, but I could achieve that with a parallel reduction after calculating the pairwise distances.

Thanks in advance :P