Cublas and tflop measure, is it possible at all to measure tflop to any reasonable degree of accuracy?

g900nvda · May 13, 2023, 6:26pm

I have been scouring over the available documentation but it seems scarse or non-existent at best. Measuring tflop advertised per gpu model i.e. fp32, fp64 etc.,

I started off with ROCm MI GPUS and its rocBLAS has some utility once it is built, called rocblas-bench but numbers are far off. I switched off to RTX NVidia and in turn cuBLAS which is equivalent of rocBLAS but there is not much available on any tools or documentation of measuring tflops to any degree of accuracy close to advertised tflops.

I see some blogs from Nvidia about comparison using some code snippet using cublas API, but those are all relative comparison to different GPUs but nothing about absolute numbers.

Robert_Crovella · May 13, 2023, 6:39pm

The published specifications for devices are generally peak theoretical numbers. They are not achievable in practice. In practice you may get in the range of 70%-90% of these numbers with well designed tests (e.g. CUBLAS). You can find examples with a bit of searching. here is one I worked on recently, albeit fp16.

If you want to provide an example of the work that you did with cublas, along with your measurements, how you measured, and the device you are running on, it would be possible to take a closer look.

Here’s an example I just whipped up, following cublasDgemm documentation:

$ cat t10.cu
#include <cublas_v2.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>

#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int ds = 4096;
int main(){

  cublasHandle_t h;
  cublasStatus_t stat = cublasCreate(&h);
  cublasOperation_t transa = CUBLAS_OP_N;
  cublasOperation_t transb = CUBLAS_OP_N;
  int m = ds;
  int n = ds;
  int k = ds;
  const double alpha = 1.0;
  double          *A, *B, *C;
  int lda = ds;
  int ldb = ds;
  const double beta = 0.0;
  int ldc = ds;
  const int dsb = ds*ds*sizeof(A[0]);
  cudaMalloc(&A, dsb);
  cudaMalloc(&B, dsb);
  cudaMalloc(&C, dsb);

  // warm-up
  stat = cublasDgemm(h,
                           transa, transb,
                           m, n, k,
                           &alpha,
                           A, lda,
                           B, ldb,
                           &beta,
                           C, ldc);
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  stat = cublasDgemm(h,
                           transa, transb,
                           m, n, k,
                           &alpha,
                           A, lda,
                           B, ldb,
                           &beta,
                           C, ldc);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  unsigned long long dsl = ds;
  if (stat == CUBLAS_STATUS_SUCCESS)
    std::cout << dsl*dsl*dsl*2/(float)dt << "MF/s" << std::endl;
}


$ nvcc -o t10 t10.cu -lcublas
$ ./t10
1.63598e+07MF/s
$ ./t10
1.65689e+07MF/s
$ ./t10
1.49358e+07MF/s
$

This is being run on CUDA 12.0 on a A100 (SXM4) GPU, so we are hitting about 75% or better of the stated peak number of ~20TF

g900nvda · May 14, 2023, 9:46pm

thanks, i will try that but i dont have a100 but have rtx2070, but that does not matter.
each model has its own rated peak numbers. I do realize gpu tflop measurement can not be achieved as many factors into play on what is happening at each level, so my title is “reasonable degree of accuracy”. In contrast 10G network line-rate can be easily validated at 9.9G/s with readily available open source tool i.e. iperf.

Topic		Replies	Views
Is there any official benchmark tool to test a GPU's FLOPS? GPU-Accelerated Libraries cublas , cutlass	3	5235	October 24, 2023
How to measure the performance of a GPU? CUDA Programming and Performance	2	1014	December 3, 2018
Something looks like wrong - Gflops of Gt 330m(mobility cuda) CUDA Programming and Performance	1	1042	February 7, 2012
cublas sgemm,dgemm performance issue on telsa 10 and gtx 570 GPU-Accelerated Libraries	1	1291	February 24, 2013
Question about GPU FLops CUDA Programming and Performance cuda , kernel	5	66	November 19, 2024
CUBLAS question a question about performance of CUBLAS CUDA Programming and Performance	4	5981	November 11, 2009
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28038	February 1, 2011
Reduced CuBLAS performance on a particular problem size? GPU-Accelerated Libraries	0	426	October 13, 2020
cuBlas performance dramatically drops after some iterations CUDA Programming and Performance	4	894	January 18, 2015
CUBLAS dgemm performance query CUDA Programming and Performance	4	2041	January 12, 2012

Cublas and tflop measure, is it possible at all to measure tflop to any reasonable degree of accuracy?

Related topics