Greatings,
I’ve written a simple c code that multiplies two square matrices via cublas. I check the time that it takes to run these operations (including allocating memory, transferring data from host to device, and vice versa) using the C clock() function and here is what I found. I assumed that there are roughly 2N^3 floating point operations for a given NN matrix.
N = 400 → 13.3 Gflop / sec (dgemm), 30.5 Gflop / sec (sgemm)
N = 800 → 26.7 Gflop / sec (dgemm), 62.8 Gflop / sec (sgemm)
N = 1600 → 55.4 Gflop / sec (dgemm), 138.8 Gflop / sec (sgemm)
My questions are these.
- Given that I am using a Mac OS X laptop with NVIDIA GeForce 9400M card, are these numbers reasonable?
- Roughly how much more performance improvement can I obtain using a better card?
- How can I roughly estimate the maximum size of the matrix that I can multiply (before I presumably run out of memory) if I know the specs on my card?
- On my current platform, what can I do to improve upon these results?
Thanks