why the Tesla T4 peak performance test result mismatch with the official doc

I have made a Tesla T4 peak performance test using cublas 10.0 library in ubuntu 14.04/cuda 10.0/cuda driver 410.79 version environment. The test result is as follows.

TYPE program size actual theoretical
FP32 cublasSgemm(NT) 8192x8192 5.4TFLOPS 8.1TFLOPS
INT8 cublasGemmEx(NT) 16384x16384 15.2TOPS 130TOPS
FP16 cublasGemmEx(NT) 16384x16384 17TFLOPS 65TFLOPS

The test T4 peak performance result is much worse than the theoretical value as mentioned in official ‘turing architecture whitepaper’ doc. I don’t know why this result occurs. I guess the ubuntu 14.04 may not support CUDA 10.0 very well.
Could anyone know this problem and give me a help? Thanks very much.

Can you post your test program, so we know what you are running and how you are measuring the performance?

I took a look at the Turing Architecture Whitepaper (https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf) and did not see performance numbers for the Tesla T4 in it. Can you provide the location (URL) of the whitepaper you are looking at, and indicate the page where the performance numbers are listed?

Re SGEMM, note that theoretical FLOPS numbers are generally not achievable with matrix multiplications, on any platform. For large matrices, compiled codes might achieve 75% of theoretical throughput, hand-optimized code might achieve 90%.

My guess would be that you are seeing low numbers across the board because you have not set application clocks correctly using nvidia-smi

Try this command before running your test:

nvidia-smi -ac 5001,1590

Here’s an example:

$ cat t1412.cu
#include <cublas_v2.h>
#include <iostream>
#include <stdlib.h>
#include <assert.h>
#include <cuda_fp16.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const size_t default_ds = 8192;
typedef __half ft;
int main(int argc, char *argv[]){

  size_t ds = default_ds;
  if (argc > 1) ds = atoi(argv[1]);
  ft *d_A, *d_B, *d_C, *h_C = new ft[ds*ds];
  cublasHandle_t  h;
  cublasStatus_t err = cublasCreate(&h);
  assert(err == CUBLAS_STATUS_SUCCESS);
  err = cublasSetMathMode(h, CUBLAS_TENSOR_OP_MATH);
  assert(err == CUBLAS_STATUS_SUCCESS);
  cudaError_t cerr = cudaMalloc(&d_A, ds*ds*sizeof(ft));
  assert(cerr == cudaSuccess);
  cerr = cudaMalloc(&d_B, ds*ds*sizeof(ft));
  assert(cerr == cudaSuccess);
  cerr = cudaMalloc(&d_C, ds*ds*sizeof(ft));
  assert(cerr == cudaSuccess);
  ft a = __float2half(1.0f);
  ft b = __float2half(0.0f);
  cublasHgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, ds, ds, ds, &a, d_A, ds, d_B, ds, &b, d_C, ds);
  assert(err == CUBLAS_STATUS_SUCCESS);
  cerr = cudaDeviceSynchronize();
  assert(cerr == cudaSuccess);
  unsigned long long dt = dtime_usec(0);
  cublasHgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, ds, ds, ds, &a, d_A, ds, d_B, ds, &b, d_C, ds);
  cerr = cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  assert(err == CUBLAS_STATUS_SUCCESS);
  assert(cerr == cudaSuccess);
  std::cout << "elapsed: " << dt/(float)USECPSEC << " GF: " << (USECPSEC*(2.0*ds*ds*ds))/(((float)dt)*(1024ULL*1048576ULL)) << std::endl;
  return 0;
}
$ nvcc -o t1412 t1412.cu -lcublas
$ nvidia-smi
Tue Jan 15 13:50:12 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.76       Driver Version: 410.76       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:02:00.0 Off |                    0 |
| N/A   33C    P8    15W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:82:00.0 Off |                    0 |
| N/A   32C    P8    15W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvidia-smi -rac
All done.
$ ./t1412 8192
elapsed: 0.055894 GF: 18320.4
$ nvidia-smi -ac 5001,1590
Applications clocks set to "(MEM 5001, SM 1590)" for GPU 00000000:02:00.0
Applications clocks set to "(MEM 5001, SM 1590)" for GPU 00000000:82:00.0
All done.
$ ./t1412 8192
elapsed: 0.023378 GF: 43801.9
$

Thanks for reply.

I am pretty sure that the application clocks was set successfully followed https://devblogs.nvidia.com/increase-performance-gpu-boost-k80-autoboost/.
Fisrtly, set persistence mode.
Secondly, set application clocks.
Finally, query the application clocks. (5001,1590)

And with your code, the program got 21TFLOPS in default clocks(5001, 585). Also upto 43TFLOPS after set clocks to (5001, 1590).

Meanwhile, I also modified cuda sample matrixMulCUBLAS to test cublasHgemm, but only got 24TFLOPS.

I compared 2 projects, only found initial values of d_A,d_B are different.

And I add some initialize code in your code, when I initialized with same values(1.0), the program got high performance 36TFLOPS.
But when I use rand values to initilize A,B, seems only 22TFLOPS.

#include <cublas_v2.h>
#include <iostream>
#include <stdlib.h>
#include <assert.h>
#include <cuda_fp16.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
#define NITER 300

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const size_t default_ds = 8192;
typedef __half ft;
int main(int argc, char *argv[]){

  size_t ds = default_ds;
  if (argc > 1) ds = atoi(argv[1]);
  ft *d_A, *d_B, *d_C;
  ft *h_A = new ft[ds*ds];
  ft *h_B = new ft[ds*ds];
  ft *h_C = new ft[ds*ds];
  for (int i = 0; i < ds*ds; i++)
  {
    //h_A[i] = __float2half(1.0);
    h_A[i] = __float2half(rand() / (float)RAND_MAX);
    //h_B[i] = __float2half(1.0);
    h_B[i] = __float2half(rand() / (float)RAND_MAX);
  }

  cublasHandle_t  h;
  cublasStatus_t err = cublasCreate(&h);
  assert(err == CUBLAS_STATUS_SUCCESS);
  err = cublasSetMathMode(h, CUBLAS_TENSOR_OP_MATH);
  assert(err == CUBLAS_STATUS_SUCCESS);
  cudaError_t cerr = cudaMalloc(&d_A, ds*ds*sizeof(ft));
  assert(cerr == cudaSuccess);
  cerr = cudaMalloc(&d_B, ds*ds*sizeof(ft));
  assert(cerr == cudaSuccess);
  cerr = cudaMalloc(&d_C, ds*ds*sizeof(ft));
  assert(cerr == cudaSuccess);
  ft a = __float2half(1.0f);
  ft b = __float2half(0.0f);

  cudaMemcpy(d_A, h_A, ds*ds*sizeof(ft), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, ds*ds*sizeof(ft), cudaMemcpyHostToDevice);

  cublasHgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, ds, ds, ds, &a, d_A, ds, d_B, ds, &b, d_C, ds);
  assert(err == CUBLAS_STATUS_SUCCESS);
  cerr = cudaDeviceSynchronize();
  assert(cerr == cudaSuccess);

  unsigned long long dt = dtime_usec(0);

  for (int i = 0; i < NITER; i++)
  {
    cublasHgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, ds, ds, ds, &a, d_A, ds, d_B, ds, &b, d_C, ds);
  }
  cerr = cudaDeviceSynchronize();

  dt = dtime_usec(dt);
  float dt_us = (float)dt / NITER;
  assert(err == CUBLAS_STATUS_SUCCESS);
  assert(cerr == cudaSuccess);
  std::cout << "elapsed: " << dt_us/(float)USECPSEC << " GF: " << (USECPSEC*(2.0*ds*ds*ds))/(((float)dt_us)*(1024ULL*1048576ULL)) << std::endl;
  return 0;
}

PAGE WP-09183-001_v01 | 60
Table 5 Comparison of the Pascal Tesla P4 and the Turing Tesla T4.

I also test cublasSgemm(NT 8192) on GTX 1080, the program achieved 8.1TFLOPS(8.8TFLOPS).
As you said, hand-optimized code can achieve 90%, but cublasSgemm already use sass to optimize.

And I wrote CUDA kernel code to test 4096x4096 sgemm_nt on GTX 1080(kernel_64x64), the kernel code cam achieve 80%+ of peak performace.

1.Another question about FP32 peak performance, I have tested Tesla T4, Tesla P4 & Geforce GTX 1080.
All tests are based on default clocks.

GPU function matrix_size actual(TFLOPS) peak(TFLOPS) percentage
Tesla T4 cublasSgemm(NT) 8192 5.0 8.1 61%
Tesla P4 cublasSgemm(NT) 8192 4.67 5.5 85%
GTX 1080 cublasSgemm(NT) 8192 8.1 8.8 92%

Why actual FP32 performance of Tesla T4 is not so good?
And INT8 performance is even worse than FP32. (15TOPS / 130TOPS = 11.5%)

2.And I use cuda bandwidthTest sample to test global memory bandwidth, only got 237GB/s of 320GB/s. Is that reasonable?

The bandwidth achieved looks reasonable. The general trend with GDDR seems to be that efficiency (achieved vs theoretical throughput) declines with each new generation. This is likely due to technical issues inherent in this type of memory, such as latencies not improving (much) from generation to generation. If memory serves, GPUs with GDDR5 could achieve up to 85% efficiency, with GDDR5X this dropped to around 80%, and it looks like we are now at around 75% with GDDR6 based on your data.

First-generation memory controllers for new DRAM types often suffer from various small inefficiencies that are eliminated in subsequent generations as processor designers gather practical experience with the new memory type. Since this is the first use of GDDR6 in NVIDIA GPUs, one could reasonably speculate that small improvements (e.g. 2% - 3%) to the efficiency of GPU memory access may be seen in future parts that make use of GDDR6.

I do not have personal experience with Turing GPUs, so I cannot explain the surprisingly low SGEMM efficiency you observe. Setting the application clocks to the highest supported boost clocks, as suggested by Robert Crovella, seems like the right approach, because it seems safe to assume that the peak numbers from table 5 assume use of those clocks. But you state that you are already using those clocks and performance is still low.

I guess there is the possibility that NVIDIA has not had sufficient time to fully optimize SGEMM on Turing. Writing optimized assembly code is a time-consuming endeavor. If this is important to your use case, you could file an enhancement request with NVIDIA with regard to SGEMM performance. Use the bug reporting page to file such a request, and prefix the synopsis with “RFE:” so it is readily recognized as an enhancement request.

In “NVIDIA Tensor Core Programmability, Performance & Precision”, Stefano Markidis test HALF PRECISION preformance of tensor core using cublasGemmEx(Tesla V100). They got 83TFLOPS of 112TFLOPS(74%). Also they got better SINGLE PRECISION performance on Tesla V100(10+TFLOPS/14TFLOPS).

And I dumped volta_sgemm_128x128_nt sass code, the instructions are not very hard to understand. Maybe the SGEMM code is not fully optimized。

yes, i have the same problem. I guess there is the possibility that NVIDIA has not had sufficient time to fully optimize SGEMM on Turing.