In the BigO term, what is the algorithmic complexity of random shape matrix multiplication by CUDA? What kind of alleviation is observed compared to native CPU code?

I assume you a referring to time complexity. Standard matrix multiplication has a complexity of O(m*n*k). There are alternate multiplication schemes of lower complexity, like those of Strassen or Coppersmith-Winograd, but to my knowledge they are not used by CUBLAS. You could use them in your own code though. If you o down that path, beware of numerical issues.

Not sure what you mean by “alleviation”. If you are referring to speedup it depends on which GPU and which CPU you are talking about, and which library. Note that hardware differences and software optimizations can cause a big difference in constants which disappear in big-O, e.g. if one program takes time 1000*n, the other 10*n, they are both of complexity O(n), although the latter is 100 times faster compared to the former.