Tx2i GPU compute capability test


When we use tx2i, we want to test the GPU compute capability. Now when using matrixMulcublas, the test compute capabilityis only 0.5TFLops. However, the theoretical value is 1.26TFLOPS. Why is this test value so different from the theoretical value?

We now want to test the GPU compute capability unit TOPS, is that OK?

If we have any problems in testing, please provide us with the correct method and testing tools.

And when we use lscpu to check the cpu parameters, the data can also be marked as 4+2 core does not match, why?


Have you maximized the device’s performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

You can find some benchmark samples in the below GitHub:

More, TX2 doesn’t have INT8 operators.
INT8 operation is supported on the GPU architecture greater than 7.x like Xavier or Orin.


Thanks for your reply

In the previous test, we have set the maximum performance (e.g., maximum power consumption, fans keep running), but the test value is only 0.55TFLOPS at most, far below the official figure of 1.26TFLOPS.

We would like to know what methods and tools nvidia used to test 1.26TFLOPS? The evaluation index of our project is that GPU computing power is greater than 1TFLOPS. If this index is not reached, our project will fail and fail to pass the acceptance, which has a great impact. The time of our project is very tight, please reply as soon as possible… Our testing methods and tools are as follows:
And now we use the test tool system at tx2i
/usr/local/cuda/samples / 0 _simple/matrixMulCUBLAS, biggest and 0.55 TFLOPS.
Using the benchmark test results are as follows :

In addition,the GPU compute capability gain with and without max-on is only 0.1TFLOPS.


Could you share the flops.py with us so we can check?

We are using matrix multiplication method for testing, can we use this method for validation on your platform? If the validation result is still 0.55 TFLOPS. since the specifications of our project require the compute capability result to be greater than 1 TFLOPS, please give us a test methodology to test the compute capability values greater than 1 TFLOPS。

Attachment is the GPU to compute capability system example
matrixMulCUBLAS.7z (133.0 KB)

  • The following example is the use of benchmark

import torch
from torch.utils import benchmark

typ = torch.float16 # Data Accuracy
n = 1024 * 16
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
stmt=‘a @ b’,
globals={‘a’: a, ‘b’: b})

x = t.timeit(50)
print(2*n**3 / x.median /1e12)


Just want to confirm first.
Do you use matrixMulCUBLAS or the PyTorch sample to benchmark.

Since PyTorch is a third-party library, it’s not recommended to test the device maximal performance.


We use matrixMulCUBLAS

Below is the test code for matrixMulCUBLAS
matrixMulCUBLAS.7z (133.0 KB)

We would like to know what you use for testing? How to test, can the compute capability exceed 1Tflops


We will try your source and share more info with you later.


Please try to set the --sizemult=[value].
This flag can increase the GPU job complexity in cuBLAS.

The default value is too low to show the performance.


Thanks for the reply.
We would like to know what you use for testing? How to test, can the compute capability exceed 1Tflops,Can you adjust sizeMult to test above 1TFLOPS?

The data we tested using matrixMulCUBLAS to adjust SizeMult is shown below。


The spec is measured with FP16.
But matrixMulCUBLAS runs with FP32.


I would like to know if the arithmetic of 1.26TFLOPS for TX2i mentioned on Nvidia’s website is a measured value or a theoretical value? If it is a measured value, we have been trying with various methods for more than 2 months, the project requires that the arithmetic test value must be greater than 1TFLOPS, but we have tested it with TX2i for many times, and the maximum value is only about 0.55TFLOPS, please provide us with the official and correct test method, so that we can finish the test as soon as possible.
The matrixMulCUBLAS mentioned in the previous reply runs on top of FP32. We use the method in matrixMulCUBLAS, after modifying it to FP16 bit, it suggests that FP16 is not supported, so can we modify it to run on FP16? How to modify, please provide detailed description and modification method, if you need to upgrade part of the system, please provide specific content and method.
If you don’t need to upgrade the system, please provide detailed test methods and tools for FP16 base.


We need to double-confirm with our internal team.
Will update more info with you later.



The score is the AI performance in FP16 mode.
Please check the jetson_benchmarks sample shared on Aug 30.


We are experimenting with the method in the August 30 reply. But the online installation has not been successful, can you provide an offline installer?
But we have a question, the unit of test in jetson_benchmarks is FPS, is there any documentation on the conversion relationship between FPS and TFLOPS?


What kind of error did you encounter when installing the dependencies?
You can get the TFLOPS with our profiler.


I did not test successfully using the jetson_benchmarks environment. Below is the description of my testing process:
Attached is the jetson_benchmarks environment I downloaded from github, I only added the models folder to it to store the contents of the files downloaded by jetson_benchmarks, and the jetson_benchmarks-master\test file under the root directory, which records the information I printed during the operation. There is no successful download. Please help to analyze the problem and suggest a solution, thanks!!!

If the test using jetson_benchmarks is successful, how to get the TFLOPS value using nvidia profiler
jetson_benchmarks-master-test.zip (25.1 KB)


Do you encounter any errors when downloading?
The below topic contains some info that can help you set up the environment:

The profiler can report some metrics related to TFLOPS.
For example:

$ sudo /usr/local/cuda/bin/nvprof --metrics inst_fp_32 /usr/local/cuda/samples/0_Simple/vectorAdd/vectorAdd