Reasonable timing with Cublas dgemm and sgemm


I’ve written a simple c code that multiplies two square matrices via cublas. I check the time that it takes to run these operations (including allocating memory, transferring data from host to device, and vice versa) using the C clock() function and here is what I found. I assumed that there are roughly 2N^3 floating point operations for a given NN matrix.

N = 400 -> 13.3 Gflop / sec (dgemm), 30.5 Gflop / sec (sgemm)
N = 800 -> 26.7 Gflop / sec (dgemm), 62.8 Gflop / sec (sgemm)
N = 1600 -> 55.4 Gflop / sec (dgemm), 138.8 Gflop / sec (sgemm)

My questions are these.

  1. Given that I am using a Mac OS X laptop with NVIDIA GeForce 9400M card, are these numbers reasonable?
  2. Roughly how much more performance improvement can I obtain using a better card?
  3. How can I roughly estimate the maximum size of the matrix that I can multiply (before I presumably run out of memory) if I know the specs on my card?
  4. On my current platform, what can I do to improve upon these results?


The 9400M is a compute 1.1 capable card which can’t do double precision, so on that basis alone I would have to say no. Further to that, my recollection is that a 9500GT with DDR3 memory (so both a bit more memory bandwidth and double the number of MP as your mobile part) could hit about 40GFlop/s with CUBLAS sgemm(). I did bench a couple of ION boxes at some point, but I am struggling to remember the numbers. Certainly closer to 25Gflops than 100Gflops.

I would be going back and checking your timing code, because something looks a bit wrong.

You are right. I made an error by putting one of the clock() functions at the wrong place.

Also, I checked for dgemm and realized that I was getting garbage values.

New results.

N = 100 1.47Gflop/sec

N = 200 3.45Gflop/sec

N = 400 5.04Gflop/sec

N = 800 7.68Gflop/sec

N = 1600 13.39Glop/sec (all sgemm)

Any technique I can use to improve upon these results on my machine that would also translate reasonably well to better GPUs?

Not really, no. If you want to see numbers for other cards for that same basic code, this thread might be of some interest.

It’s not clear to me what matrix size people are talking about when they list their Gflop/sec. Are people talking about the max Gflop/sec?

I imagine so. For your reference, a stock GTX275 gives a little bit less the 400GFlops in SGEMM() for a 1024x1024 matrix, about 240GFlops if you include the PCI-e bus copy times to and from standard pageable memory. This is from a little benchmark I wrote a long time ago, built with CUBLAS 2.3 (the host BLAS is Goto BLAS 2 running with 2 threads on a 3GHz Phenom II):

For N=1024

Optimized host SGEMM() solution time = 0.060839

Optimized host SGEMM() GFLOPS = 35.297711

CUBLAS SGEMM() solution time = 0.005428, transfer time to host = 0.003576

CUBLAS Gflops = 395.642592

CUBLAS Gflops (including host transfer time) = 238.506534

One thing that I noticed is that upon looking at the times, most of the CPU time is spent on copying data from the device to the host.

N = 1600, solution time = 0.598235 seconds, time spent on copying from device to host = 0.5397 seconds (~90%).

  1. is it reasonable that there that so much time is spent on this portion of the code rather than the actual sgemm()?

  2. how come so much more time is spent copying C from device to host rather than copying A and B from host to device?


Without seeing any code, I will guess that your timing is still wrong. CUBLAS calls are asynchronous - the function returns immediately rather than when the GPU operation is completed. I am guessing your device->host copy also includes the SGEMM execution time. Call cudaThreadSynchronize() before you stop your SGEMM call timer and you will get a more realistic indication of the timing.

You are right again. Btw, I remember a figure that you posted in another thread comparing performances (Gflop/sec) between GPU matrix-mult and CPU (acml) matrix-mult. Here it is.…st&id=19358

Do you know why the Gflop/sec for acml goes up after around N = 1000? When I look at my matrix-mult (albeit I’m using sgemm and not dgemm), the numbers looks as follows.

N = 800, 9.30Gflop/sec

N = 1600, 9.41Gflop/sec

N = 3200, 9.59Gflop/sec

N = 6400, 9.68Gflop/sec

Those runs were made with 4 CPU threads. I think the DGEMM() implementation in ACML uses several different kernels, depending on matrix size (actually CUBLAS does the same). The big step up in performance for N>1000 seems to be when the multithreaded “big” version starts being used.

This is a side topic but I noticed that a call to DGEMM() using Intel MKL provides a speedup upon increasing the number of processor (via openmp omp_num_thread) from 1 to 8. In a lot of cases, the speedup even hits around 7+. Does the DGEMM() recognizes parallelization even when it isn’t explicitly instructed? The speedup compares well with breaking up the matrix using openmp for loop.

No idea, I don’t use MKL. Asking Intel is probably the only way to find out for sure.

I have compared times for sgemm and dgemm using quadro plex 2200 s4. A few things to consider.

  • I have not included cublasInit() in the timing.

  • I have assigned random numbers to my input matrices (A and but have not included this process in the timing.

[codebox] /* performs operation using plain C code (double) */

69 double start = clock();


71 /* Allocate device memory for the matrices */

72 cublasAlloc(n2, sizeof(d_A[0]), (void**)&d_A);

73 cublasAlloc(n2, sizeof(d_B[0]), (void**)&d_B);

74 cublasAlloc(n2, sizeof(d_C[0]), (void**)&d_C);


76 cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);

77 cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);


79 cublasDgemm(‘n’, ‘n’, M, M, N, alpha, d_A, M, d_B, N, beta, d_C, M);

80 cudaThreadSynchronize();


82 cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);


84 /* memory cleanup */

85 free (h_A);

86 free (h_B);

87 free (h_C);

88 cublasFree(d_A);

89 cublasFree(d_B);

90 cublasFree(d_C);


92 double end = clock();

93 printf(“total time = %lf\n\n”, (end - start) / CLOCKS_PER_SEC);[/codebox]

Here are the numbers that I get for sgemm and dgemm

N = 1600 117Gflop/sec, 43Gflop/sec

N = 3200 204.8Gflop/sec, 56.98Gflop/sec

N = 6400 274.50Gflop/sec, 65.61Gflop/sec

N = 12800 315.83Gflop/sec, 69.07Gflop/sec

a) Are these numbers reasonable?

If so, can I do anything to improve upon this - specifically the double precision numbers (given that I have 4 total GPUs to work with and am only using one of them)?

c) How much improvement can we see from the new Fermi GPUs?


They certainly look more reasonable than anything else you have posted thus far. The SGEMM numbers look a little lower that I get, but I have no experience with that quadro plex box, nor do I know what sort of machine you are hosting it off. You most certainly can improve the numbers, given you have 4 GPUs at your disposal. I will leave the how up to your imagination.

Given that nobody actually has a Fermi based gpu yet, and there is no credible benchmark data (or even firm specifications for the GPU silicon) in the wild yet, I don’t think anybody outside of NVIDIA can answer that question with any authority. Certainly not I.

The simplest solution I can think of is to copy the input matrices A and B from the host to the 4 GPUs and let the each GPU do work on 1/4 of the output matrix. Finally, we can read back the result back to the host.

Would there be potentially problems here and any other implementation that might be faster than this?

That would be a reasonable approach, yes.

The Implementation details aren’t that straightforward. The host code must be multithreaded, with each host threading holding a CUDA context. There are ways to potentially improve on that model, basically ways to either eliminate or hide as much of the fixed overheads and latency. There isn’t anything to say the the host CPU can’t be doing something while the GPUs are busy - either some of the matrix multiply itself, or uploading data to the GPU while it is still calculating (although that requires some features that the CUBLAS API itself doesn’t expose, although it exists in CUDA) and taking over memory management of the GPU yourself, which can be reduced to a zero overhead operation.

Producing the optimal scheme is a surprisingly detailed process that requires a lot of thought. It is something not a lot of people seem to have done or if they have, they haven’t published anything about it, with the exception of M.Fatica from NVIDIA, who wrote a very good paper about CUDA accelerating the Linpack benchmark. I guess you can put me in that category as well, although that might change quite soon.

At this point, there is much not a lot more to add to this discussion. Best of luck.