https://stackoverflow.com/questions/66247190/why-is-numpy-matrix-multiplication-faster-than-cuda
I’ve seen this question asked for other computations, and the answer was usually “that computation is a bad example for GPU acceleration.” However, I thought that matrix multiplication was a sort of gold-standard for the benefits of GPU acceleration. Needless to say, I didn’t expect matrix multiplication with my CPU to be 6x faster than with my parallel algorithm for CUDA.
Is there an obvious explanation for why this is?