I am doing some tests about matrix multiplication with cuda 3.2RC on Geforce GTX 480. Some results are really surprising. All the test code are based on CUBLAS and computes MxN. Matrix size is 4503x4503. M1 is a dense matrix, M2 is a block diagonal matrix, and in form:

[N11 0 0 ]

[0 N22 0 ]

[0 0 N33]. N11, N22, N33 are 1501x1501 matrices

- sgemm is not faster than dgemm. sgemm takes ~0.000118 seconds, dgemm takes ~9.8e-05 seconds. This is definitely out of my understanding. I was expecting dgemm to be slower than sgemm.
- cublasFree() takes much longer for double precision than single precision. I also found that the first time to call cublasFree() consumes much more time than the rest, about 3 seconds for double precision and 0.6 seconds for single precision. For the rest calls, it takes about 0.0025 seconds for double precision and 0.0012 seconds for single precision. Could someone explain why releasing memory for the first time is so time consuming and why releasing memory for double precision is much slower?
- I expected that computing of M1xM2 can be faster by leveraging the block diagonal structure of N. So in my implementation of MxN, the computing of M11xN12, M11xN13, etc, is avoided. However, it turns out that just calling sgemm or dgemm is about 5-6 times faster than my own implementation. This shows that 3 sequential execution of matrix multiplication of size 1501x1501 is much slower than one matrix multiplication of size 4503x4503. This is explainable but still out of my expectation.

Your comments are welcome.