In my algorithm I’m making several calls to cublasStrmm (matrix multiply), all something like this:
cublasStrmm('r', 'u', 'n', 'n', 10000, 65, 1.0, A, 65, B, 10000);
So we have m=10000, n=65 and k=65.
I do 72 of these calls, and the total time taken is 645ms, so that’s 9ms per call
This seems a bit slow, even for the funny size matrices I’m using.
By my calculations

the number of fp operations per call is (65 * (65+1) / 2) * 2 * 10000 = 42900000

so the FLOPS acheived is 42900000 / 0.009 = 4766666666 (4.76 GFlops)
I was expecting something much higher than this? Is anyone else seeing similar performance? Or have I got my calculations wrong? :lol:
Thanks,
Alex