Timing cublasSgemm RDTSC


Could someone please suggest me the most accurate way to time cublasSgemm.
Machine is AMD Opteron & so RDTSC instruction cannot be used.

( AMD Technical Bulletin says that it provides power management mechanisms that independently adjust the performance state (“P-state”) and power state (“C-state”) of the processor; these state changes can affect the rate at which a processor core’s Time Stamp Counter (TSC) is incremented.

Applications should avoid using the TSC directly (through the RDTSC instruction) for
time keeping and instead rely on the appropriate operating system calls.)

When I use gettimeofday function, its taking a 1000*1000 matrix multiplication about 0.001046 second
or 957 Gflops/sec. Unfortunately, one can’t rely on the accuracy of gettimeofday function.


it’s impossible that you reached 957 Gflops !!

for m = n = k = 1000, it’s equal to 190.9 Gflops…

I think you can use cutil.h functions:
unsigned int timer;

double time;

// when you want to time something


time = cutGetTimerValue(timer);
// time is in milliseconds, so:
double gflops = ((mn(2.*k-1.))/time)/1e6

Thanks a lot Samuel for the detailed reply.

Have you measured the accuracy of these functions ( cutCreateTimer, cutStartTimer ) ? Do they use any hardware counters - assembly RDTSC instruction in background? If so, then there is a chance that on dual core AMD, the counter value might be getting affected because of thread switching between the cores and hence wrong TSC value.

Thanks once again for your help.

gettimeofday should be fine for timing. Just be sure to average up a few milliseconds worth of calls between two gettimeofday calls to be sure that the precision of the timer isn’t affecting your results.

Calls to cuda kernels are asynchronous, too (meaning that they return right away). For accurate timing, you need to precede any timer call with cudaThreadSynchronize().


Since I couldnt use RDTSC because of wrong TSC count value, I have used CUDA’s timing function.
Its Tesla s870 on AMD Opteron.

1000*1000 matrix multiplication using cublasSgemm

cublasSgemm(‘n’, ‘n’, N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
cudaThreadSynchronize(); // Keeping this as optional
float cublasTime = cutGetTimerValue(timer); // cublasTime in milliseconds
float flops = (2.0NN*N)*1000 / cublasTime;
printf( “CUBLAS GFLOPS: %g\n”, flops / pow(10.0f, 9));

My output

Without ThreadSynchronize = .198 ms OR 10101 GLOPS

With ThreadSynchronize = 22.414 ms OR 89.2299 GFLOPS

Output without using cudaThreadSynchronize is way wrong. But with cudaThreadSynchronize its 89.2299 GFLOPS which is much much less than limit 518.4 GFLOPS

Any comments / experiences regarding performance would really be appreciated.

In marketing materials, NVIDIA includes the operations performed by the texture units in the peak GFLOPS. As your particular application does not make use of this hardware, you can only expect to see 345 GFLOPS. This is still well above your observed performance, of course.

I’m pretty sure that CUBLAS 2.0 includes a more optimized Sgemm, provided by someone in the forums, that hits 200 GFLOPS.