Why is numpy matrix multiplication faster than CUDA?

https://stackoverflow.com/questions/66247190/why-is-numpy-matrix-multiplication-faster-than-cuda

I’ve seen this question asked for other computations, and the answer was usually “that computation is a bad example for GPU acceleration.” However, I thought that matrix multiplication was a sort of gold-standard for the benefits of GPU acceleration. Needless to say, I didn’t expect matrix multiplication with my CPU to be 6x faster than with my parallel algorithm for CUDA.

Is there an obvious explanation for why this is?