Cublas and tflop measure, is it possible at all to measure tflop to any reasonable degree of accuracy?

I have been scouring over the available documentation but it seems scarse or non-existent at best. Measuring tflop advertised per gpu model i.e. fp32, fp64 etc.,

I started off with ROCm MI GPUS and its rocBLAS has some utility once it is built, called rocblas-bench but numbers are far off. I switched off to RTX NVidia and in turn cuBLAS which is equivalent of rocBLAS but there is not much available on any tools or documentation of measuring tflops to any degree of accuracy close to advertised tflops.

I see some blogs from Nvidia about comparison using some code snippet using cublas API, but those are all relative comparison to different GPUs but nothing about absolute numbers.

The published specifications for devices are generally peak theoretical numbers. They are not achievable in practice. In practice you may get in the range of 70%-90% of these numbers with well designed tests (e.g. CUBLAS). You can find examples with a bit of searching. here is one I worked on recently, albeit fp16.

If you want to provide an example of the work that you did with cublas, along with your measurements, how you measured, and the device you are running on, it would be possible to take a closer look.

Here’s an example I just whipped up, following cublasDgemm documentation:

$ cat t10.cu
#include <cublas_v2.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>

#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int ds = 4096;
int main(){

  cublasHandle_t h;
  cublasStatus_t stat = cublasCreate(&h);
  cublasOperation_t transa = CUBLAS_OP_N;
  cublasOperation_t transb = CUBLAS_OP_N;
  int m = ds;
  int n = ds;
  int k = ds;
  const double alpha = 1.0;
  double          *A, *B, *C;
  int lda = ds;
  int ldb = ds;
  const double beta = 0.0;
  int ldc = ds;
  const int dsb = ds*ds*sizeof(A[0]);
  cudaMalloc(&A, dsb);
  cudaMalloc(&B, dsb);
  cudaMalloc(&C, dsb);

  // warm-up
  stat = cublasDgemm(h,
                           transa, transb,
                           m, n, k,
                           &alpha,
                           A, lda,
                           B, ldb,
                           &beta,
                           C, ldc);
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  stat = cublasDgemm(h,
                           transa, transb,
                           m, n, k,
                           &alpha,
                           A, lda,
                           B, ldb,
                           &beta,
                           C, ldc);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  unsigned long long dsl = ds;
  if (stat == CUBLAS_STATUS_SUCCESS)
    std::cout << dsl*dsl*dsl*2/(float)dt << "MF/s" << std::endl;
}


$ nvcc -o t10 t10.cu -lcublas
$ ./t10
1.63598e+07MF/s
$ ./t10
1.65689e+07MF/s
$ ./t10
1.49358e+07MF/s
$

This is being run on CUDA 12.0 on a A100 (SXM4) GPU, so we are hitting about 75% or better of the stated peak number of ~20TF

thanks, i will try that but i dont have a100 but have rtx2070, but that does not matter.
each model has its own rated peak numbers. I do realize gpu tflop measurement can not be achieved as many factors into play on what is happening at each level, so my title is “reasonable degree of accuracy”. In contrast 10G network line-rate can be easily validated at 9.9G/s with readily available open source tool i.e. iperf.