cclo
April 26, 2018, 1:57pm
1
Hi all,
I am a newbie of CUDA.
I am now evaluating the TX1 and TX2 platform. Then I encountered some problem about the performance of these 2 platforms:
Are there any solid FLOPS values about TX2 ? (FP16 / FP32 / FP64) I only know TX2 has TFLOPS capability.
What are the exact number of FP16 / FP32 / FP64 cores in TX1 and TX2 ? I only know both of them has 256 CUDA cores.
AFAIK, the FLOPS value are calculated as follows:
"Number of SM" * "Number of CUDA cores per SM" * "Peak operating freq. of GPU" * 2 (FFMA)
In TX1, it only contains FP32 cores and FP64 cores (am I right ?), and their FLOPS are:
FP32: 1 * 256 * 1000MHz * 2 = 512GFLOPS
FP16: 1 * 512 (FP16 is emulated by FP32 cores in TX1) * 1000MHz * 2 = 1024GFLOPS
Am I right ?
Sorry for so many questions.
Please help. Thanks.
njuffa
April 26, 2018, 3:45pm
3
Note that making design decisions on FLOPS alone is rarely a good idea. The application may turn out to be (partially) memory bound, especially on a platform that uses fairly low-performance memory like NVIDIA’s integrated platforms. Also, you are unlikely to get really close to the theoretical FLOPS rate in an actual application. For compiled code, 75% may be as good as it gets.
In general, FLOPS ratings are OK as a disqualifying criterion, but likely insufficient as a qualifying criterion.
Note that making design decisions on FLOPS alone is rarely a good idea. The application may turn out to be (partially) memory bound, especially on a platform that uses fairly low-performance memory like NVIDIA’s integrated platforms. Also, you are unlikely to get really close to the theoretical FLOPS rate in an actual application. For compiled code, 75% may be as good as it gets.
In general, FLOPS ratings are OK as a disqualifying criterion, but likely insufficient as a qualifying criterion.
hello,I am now evaluating the tx1 fp16 calculated performance.but I can’t get the theoretical 1TFLOPS .
i only get 0.86TFLOPS.
My experimental steps are as follows.
1, i load the jstson_clocks.sh before run the test code.
2,the test code is bellow :using Matrix Multiplication to Test Computing Performance
GitHub - hma02/cublasHgemm-P100: Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
3, I used fp16’addition and multiplication to write some simple test programs.No theoretical results were obtained.
What is the cause of this result?Is there any way to get the theoretical TFLOPS? thanks.
There is no way to get the theoretical TFLOPS on any NVIDIA GPU. Thats why the word “theoretical” is used prominently. Usually getting 80-90% is the best you can do.