I am working on an MCMC simulation where a very large number (millions) of dense matrix times dense vector calculations are required. The Matrix and vector are also quite large (matrix is 25,000x6,000 and vector is 6,000 elements long).
I am now wondering what algorithm or library has the fastest multiplication implemented (I am hoping not having to implement myself). I saw a paper that suggests cuBLAS may not be the fastest way? Is that outdated?
Any hints are appreciated.