Cublas and tflop measure, is it possible at all to measure tflop to any reasonable degree of accuracy?

The published specifications for devices are generally peak theoretical numbers. They are not achievable in practice. In practice you may get in the range of 70%-90% of these numbers with well designed tests (e.g. CUBLAS). You can find examples with a bit of searching. here is one I worked on recently, albeit fp16.

If you want to provide an example of the work that you did with cublas, along with your measurements, how you measured, and the device you are running on, it would be possible to take a closer look.

Here’s an example I just whipped up, following cublasDgemm documentation:

$ cat t10.cu
#include <cublas_v2.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>

#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int ds = 4096;
int main(){

  cublasHandle_t h;
  cublasStatus_t stat = cublasCreate(&h);
  cublasOperation_t transa = CUBLAS_OP_N;
  cublasOperation_t transb = CUBLAS_OP_N;
  int m = ds;
  int n = ds;
  int k = ds;
  const double alpha = 1.0;
  double          *A, *B, *C;
  int lda = ds;
  int ldb = ds;
  const double beta = 0.0;
  int ldc = ds;
  const int dsb = ds*ds*sizeof(A[0]);
  cudaMalloc(&A, dsb);
  cudaMalloc(&B, dsb);
  cudaMalloc(&C, dsb);

  // warm-up
  stat = cublasDgemm(h,
                           transa, transb,
                           m, n, k,
                           &alpha,
                           A, lda,
                           B, ldb,
                           &beta,
                           C, ldc);
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  stat = cublasDgemm(h,
                           transa, transb,
                           m, n, k,
                           &alpha,
                           A, lda,
                           B, ldb,
                           &beta,
                           C, ldc);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  unsigned long long dsl = ds;
  if (stat == CUBLAS_STATUS_SUCCESS)
    std::cout << dsl*dsl*dsl*2/(float)dt << "MF/s" << std::endl;
}


$ nvcc -o t10 t10.cu -lcublas
$ ./t10
1.63598e+07MF/s
$ ./t10
1.65689e+07MF/s
$ ./t10
1.49358e+07MF/s
$

This is being run on CUDA 12.0 on a A100 (SXM4) GPU, so we are hitting about 75% or better of the stated peak number of ~20TF