I have made a Tesla T4 peak performance test using cublas 10.0 library in ubuntu 14.04/cuda 10.0/cuda driver 410.79 version environment. The test result is as follows.
TYPE program size actual theoretical
FP32 cublasSgemm(NT) 8192x8192 5.4TFLOPS 8.1TFLOPS
INT8 cublasGemmEx(NT) 16384x16384 15.2TOPS 130TOPS
FP16 cublasGemmEx(NT) 16384x16384 17TFLOPS 65TFLOPS
The test T4 peak performance result is much worse than the theoretical value as mentioned in official ‘turing architecture whitepaper’ doc. I don’t know why this result occurs. I guess the ubuntu 14.04 may not support CUDA 10.0 very well.
Could anyone know this problem and give me a help? Thanks very much.
Re SGEMM, note that theoretical FLOPS numbers are generally not achievable with matrix multiplications, on any platform. For large matrices, compiled codes might achieve 75% of theoretical throughput, hand-optimized code might achieve 90%.
The bandwidth achieved looks reasonable. The general trend with GDDR seems to be that efficiency (achieved vs theoretical throughput) declines with each new generation. This is likely due to technical issues inherent in this type of memory, such as latencies not improving (much) from generation to generation. If memory serves, GPUs with GDDR5 could achieve up to 85% efficiency, with GDDR5X this dropped to around 80%, and it looks like we are now at around 75% with GDDR6 based on your data.
First-generation memory controllers for new DRAM types often suffer from various small inefficiencies that are eliminated in subsequent generations as processor designers gather practical experience with the new memory type. Since this is the first use of GDDR6 in NVIDIA GPUs, one could reasonably speculate that small improvements (e.g. 2% - 3%) to the efficiency of GPU memory access may be seen in future parts that make use of GDDR6.
I do not have personal experience with Turing GPUs, so I cannot explain the surprisingly low SGEMM efficiency you observe. Setting the application clocks to the highest supported boost clocks, as suggested by Robert Crovella, seems like the right approach, because it seems safe to assume that the peak numbers from table 5 assume use of those clocks. But you state that you are already using those clocks and performance is still low.
I guess there is the possibility that NVIDIA has not had sufficient time to fully optimize SGEMM on Turing. Writing optimized assembly code is a time-consuming endeavor. If this is important to your use case, you could file an enhancement request with NVIDIA with regard to SGEMM performance. Use the bug reporting page to file such a request, and prefix the synopsis with “RFE:” so it is readily recognized as an enhancement request.
In “NVIDIA Tensor Core Programmability, Performance & Precision”, Stefano Markidis test HALF PRECISION preformance of tensor core using cublasGemmEx(Tesla V100). They got 83TFLOPS of 112TFLOPS(74%). Also they got better SINGLE PRECISION performance on Tesla V100(10+TFLOPS/14TFLOPS).
And I dumped volta_sgemm_128x128_nt sass code, the instructions are not very hard to understand. Maybe the SGEMM code is not fully optimized。