Pairwise distance in CUDA

Are there any recent GPU (e.g., CUDA, OpenCL) implementations (or included in libraries?) of pairwise distance calculation?

The only thing I could find was this paper from 2008, that shows two implementations in CUDA.

In fact I would need not only the pairwise distance calculation but also a sum of the pairwise distances, but I could achieve that with a parallel reduction after calculating the pairwise distances.

