I now have started from scratch, flashing the TX2 with Jetpack 4.4 and completely reinstalling the host to Ubuntu 18.04.
deviceQuery gives me:
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “NVIDIA Tegra X2”
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7860 MBytes (8241651712 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
bandwidthTest --memory=pageable:
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: NVIDIA Tegra X2
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 2.7
Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 2.2
Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 32.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
bandwidthTest --memory-pinned:
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: NVIDIA Tegra X2
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 20.4
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 20.5
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 32.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
UnifiedMemoryPerf:
GPU Device 0: “Pascal” with compute capability 6.2
Running …
Overall Time For matrixMultiplyPerf
Printing Average of 20 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 0.148 0.346 0.102 0.172 0.149 0.127 0.201 0.193
16 0.213 0.715 0.214 0.645 0.196 0.173 0.777 0.661
64 0.616 0.815 0.675 1.681 0.760 0.630 1.789 1.704
256 1.588 2.152 1.915 6.058 2.230 1.683 6.179 6.086
1024 6.814 7.024 7.967 24.753 7.517 7.159 25.087 24.841
4096 37.942 38.367 42.500 110.330 40.212 40.111 110.345 110.053
16384 240.237 241.042 262.994 540.050 258.708 258.991 539.630 539.532
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
tegrastats results (without any ongoing process):
RAM 1104/7860MB (lfb 1380x4MB) SWAP 0/3930MB (cached 0MB) CPU [1%@2035,0%@2035,0%@2035,0%@2035,0%@2035,0%@2035] EMC_FREQ 0% GR3D_FREQ 0% PLL@32.5C MCPU@32.5C PMIC@100C Tboard@27C GPU@29.5C BCPU@32.5C thermal@31.2C Tdiode@28.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 843/843 VDD_4V0_WIFI 38/50 VDD_IN 2872/2855 VDD_SYS_CPU 229/254 VDD_SYS_DDR 1075/1075
I always call nvpmodel -m 0 and jetson_clocks before doing anything on the board.