Some confuse about TX1 and TX2 FLOPS calculation

Hi all,

I am a newbie of CUDA.

I am now evaluating the TX1 and TX2 platform. Then I encountered some problem about the performance of these 2 platforms:

  1. Are there any solid FLOPS values about TX2 ? (FP16 / FP32 / FP64) I only know TX2 has TFLOPS capability.

  2. What are the exact number of FP16 / FP32 / FP64 cores in TX1 and TX2 ? I only know both of them has 256 CUDA cores.

  3. AFAIK, the FLOPS value are calculated as follows:

  4. "Number of SM" * "Number of CUDA cores per SM" * "Peak operating freq. of GPU" * 2 (FFMA)
  5. In TX1, it only contains FP32 cores and FP64 cores (am I right ?), and their FLOPS are:

  • FP32: 1 * 256 * 1000MHz * 2 = 512GFLOPS
  • FP16: 1 * 512 (FP16 is emulated by FP32 cores in TX1) * 1000MHz * 2 = 1024GFLOPS
  • Am I right ?

    Sorry for so many questions.

    Please help. Thanks.

    Some information is covered in this thread:

    https://devtalk.nvidia.com/default/topic/1024825/cuda-programming-and-performance/jetson-tx2-performance/

    Having said that, you’re probably better off asking TX1 or TX2 questions on the TX1 or TX2 forum:

    https://devtalk.nvidia.com/default/board/139/jetson-embedded-systems/

    Note that making design decisions on FLOPS alone is rarely a good idea. The application may turn out to be (partially) memory bound, especially on a platform that uses fairly low-performance memory like NVIDIA’s integrated platforms. Also, you are unlikely to get really close to the theoretical FLOPS rate in an actual application. For compiled code, 75% may be as good as it gets.

    In general, FLOPS ratings are OK as a disqualifying criterion, but likely insufficient as a qualifying criterion.

    hello,I am now evaluating the tx1 fp16 calculated performance.but I can’t get the theoretical 1TFLOPS .
    i only get 0.86TFLOPS.
    My experimental steps are as follows.
    1, i load the jstson_clocks.sh before run the test code.
    2,the test code is bellow :using Matrix Multiplication to Test Computing Performance
    https://github.com/hma02/cublasHgemm-P100
    3, I used fp16’addition and multiplication to write some simple test programs.No theoretical results were obtained.

    What is the cause of this result?Is there any way to get the theoretical TFLOPS? thanks.

    There is no way to get the theoretical TFLOPS on any NVIDIA GPU. Thats why the word “theoretical” is used prominently. Usually getting 80-90% is the best you can do.