Parallel implementation of GMM


I am designing a parallel implementation of Gaussian Mixture Models in CUDA. The core functionality is computing the Mahalanobis distance between each input vector (x), D-dimensional, to each gaussian component, characterised by mean vector and Sigma matrix, both D-dimensional.
The function is

Typically, it requires one vector-matrix product and one vector-vector dot-product.
In a setting with K inputs and 1 gaussian, I can arrange the K input vectors in a KxD matrix and then perform matrix-matrix multiplication with Sigma, and then K vector-vector dot products. The result will be a K-dimensional vector.

For N components, the result should be a KxN matrix of distances, I can stack the N Sigma matrices in a row and then, by multiplying with the KxD input matrix, obtain a KxND matrix. Each row of this matrix can be reshaped as a NxD matrix and then perform a vector-matrix product to obtain each of the K N-dimensional vectors.

Does this last case seem to be the most optimal way of doing it? I think that using CUBLAS would perform better than implementing it myself, but maybe I am missing some point, as there are two operations involved (vector-matrix-vector = scalar).
Also, the common values for the matrix dimensions are, approximately, d=14-18, K=10^2-10^5, N=10-10^4

Thanks for your time,


@yengamatic I also try to implement the same computation, can you share your result if you have done it?