Why is numpy matrix multiplication faster than CUDA?

I’ve seen this question asked for other computations, and the answer was usually “that computation is a bad example for GPU acceleration.” However, I thought that matrix multiplication was a sort of gold-standard for the benefits of GPU acceleration. Needless to say, I didn’t expect matrix multiplication with my CPU to be 6x faster than with my parallel algorithm for CUDA.

Is there an obvious explanation for why this is?