TX2 Computing Performance has Dropped

Hello to everyone,
I have a Jetson TX2 with Jetpack 4.2 installed. Re-running some code developed in the last year, I observed that the performance is twice as slow, both for the CPU and GPU part, and especially for generic memory copy.

nvpmodel -m 0 and jetson_clock were called before each test.

Since nothing has changed from the last configuration, I checked the actual speed of the device GPU and Memory and found out that it is somewhat capped.

The Analysis Report produced by NSight Eclipse the last year reported:

Half Precision FLOP/s - 665.856 GigaFLOP/s
Single Precision FLOP/s - 665.856 GigaFLOP/s
Double Precision FLOP/s - 20.808 GigaFLOP/s
Multiprocessor Clock Rate - 1.3 GHz

whereas now it reports lower numbers:

Half Precision FLOP/s - 522.24 GigaFLOP/s
Single Precision FLOP/s - 522.24 GigaFLOP/s
Double Precision FLOP/s - 16.32 GigaFLOP/s
Multiprocessor Clock Rate - 1.02 GHz

This is what cudaGetDeviceProperties returns (bandwidth calculated from the other fields):

Device name: NVIDIA Tegra X2
Memory Clock Rate (KHz): 1300000
Memory Bus Width (bits): 128
Peak Memory Bandwidth (GB/s): 41.600000

Specification says that the memory bandwidth should be 59.7GB/s. How can I be sure that the Jetson runs at peak performance again? Can it be a power-related problem, maybe not enough power from the power supply?

Thanks

Hi jacopo.mocci.

You mean the test was ran with same Jetpack 4.2? Or on different JetPack version?

You may also try with another power source if you doubt that might be no enough power.

Everything was left unchanged.

I’m using the developer board with the shipped power supply that should provide 80W. There are no additional devices attached that could draw power from the board. Only the built-in WiFi is connected.

I don’t have access to other power supplies at the moment. Is there a way to troubleshoot the power supply from the Jetson? Can it be something else, like thermal throttling?

Hi,

Just want to clarify first.
You are using the same device board, same JetPack and the same app.
But the performance recently is lower than the score last year. Is it correct?

If yes, do you reflash your device or just reuse the environment last year?
By the way, do you maximize the performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

AastaLLLL,
It is correct, the experimental conditions are the same. Even the host computer is the same (a VM that is used only to crosscompile).

I maximize the performance with the aforementioned commands, in this exact order.

Hi,

Would you mind to reboot the system and profile the device to see any difference?
Thanks.

Hi,
There are no differences after rebooting.

With regard to profiling, is there any test routine to compare the performance of my TX2 to its intended performance? Just to make sure that I’m not messing up any of the steps.

Hi,

You can find some profiling tool in our CUDA sample directory.

/usr/local/cuda-10.2/samples/1_Utilities/

Since there are several release recently, it’s recommended to reflash your device to the latest first.

Thanks.

I now have started from scratch, flashing the TX2 with Jetpack 4.4 and completely reinstalling the host to Ubuntu 18.04.

deviceQuery gives me:

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA Tegra X2”
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7860 MBytes (8241651712 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

bandwidthTest --memory=pageable:

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA Tegra X2
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 2.7

Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 2.2

Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 32.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

bandwidthTest --memory-pinned:

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA Tegra X2
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 20.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 20.5

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 32.3

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

UnifiedMemoryPerf:

GPU Device 0: “Pascal” with compute capability 6.2

Running …

Overall Time For matrixMultiplyPerf

Printing Average of 20 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 0.148 0.346 0.102 0.172 0.149 0.127 0.201 0.193
16 0.213 0.715 0.214 0.645 0.196 0.173 0.777 0.661
64 0.616 0.815 0.675 1.681 0.760 0.630 1.789 1.704
256 1.588 2.152 1.915 6.058 2.230 1.683 6.179 6.086
1024 6.814 7.024 7.967 24.753 7.517 7.159 25.087 24.841
4096 37.942 38.367 42.500 110.330 40.212 40.111 110.345 110.053
16384 240.237 241.042 262.994 540.050 258.708 258.991 539.630 539.532

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

tegrastats results (without any ongoing process):

RAM 1104/7860MB (lfb 1380x4MB) SWAP 0/3930MB (cached 0MB) CPU [1%@2035,0%@2035,0%@2035,0%@2035,0%@2035,0%@2035] EMC_FREQ 0% GR3D_FREQ 0% PLL@32.5C MCPU@32.5C PMIC@100C Tboard@27C GPU@29.5C BCPU@32.5C thermal@31.2C Tdiode@28.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 843/843 VDD_4V0_WIFI 38/50 VDD_IN 2872/2855 VDD_SYS_CPU 229/254 VDD_SYS_DDR 1075/1075

I always call nvpmodel -m 0 and jetson_clocks before doing anything on the board.

One thing that I noticed is that the Memory Clock rate is 1300Mhz instead of its maximum allowable 1866Mhz. Maybe this is part of the problem. How can I set it to its maximum again?

Hi,

It looks like the memory clocks is also set to 1300Mb in JetPack4.2(from the first comment).
Is this correct?

Thanks.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one. Thanks