I’ve got a multi-threaded application that I’ve profiled. I can see that my GPU kernels are faster on the Xavier than the TX2. But overall, my application seems to run about 30-50ms slower on the Xavier. By breaking down what tasks are CPU based, I can see that CPU heavy tasks are often slower. For example, I have to poll a driver to see when a task with an external FPGA is completed. The FPGA does this task in the same amount of time regardless of what NVIDIA board I am using. So I am surmising that the slower down must be due to the scheduling of the polling wake up in Linux. A second example is the PCIe transfer times, they are also slower on the Xavier than the TX2 and we went from x4 lane to x8 lane. The driver uses a copy_from_user call, which is strictly CPU performance. So since the PCIe transfer should be physically faster, the slow down must be on the CPU side. Lastly, I have some large data manipulation loops that run on the CPU and these are slower too, albeit by only a few millieseconds. The worst offender in this case, does a lot of memory allocation.
I have everything at max power and frequency. My application only uses 50% of the RAM available.
root@tegra-ubuntu:/opt/logostech# ifconfig
eth0 Link encap:Ethernet HWaddr 00:04:4b:a7:f7:91
inet addr:172.17.1.138 Bcast:172.17.1.255 Mask:255.255.255.0
inet6 addr: fe80::abf0:6186:57ca:2c82/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:24167643 errors:0 dropped:0 overruns:0 frame:0
TX packets:15600 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2973211419 (2.9 GB) TX bytes:3273322 (3.2 MB)
Interrupt:42
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/tegrastats
RAM 1626/6848MB (lfb 984x4MB) CPU [0%@2034,0%@2035,0%@2034,0%@2035,0%@2034,0%@2034] EMC_FREQ 0%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 1% bg 13% BCPU@31C MCPU@31C GPU@29.5C PLL@31C Tboard@26C Tdiode@27.5C PMIC@100C thermal@30.4C VDD_IN 3548/3548 VDD_CPU 381/381 VDD_GPU 152/152 VDD_SOC 762/762 VDD_WIFI 0/0 VDD_DDR 1080/1080
RAM 1627/6848MB (lfb 984x4MB) CPU [0%@2035,0%@2034,0%@2035,0%@2036,0%@2035,0%@2034] EMC_FREQ 0%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@31C MCPU@31C GPU@29.5C PLL@31C Tboard@26C Tdiode@27.5C PMIC@100C thermal@30.4C VDD_IN 3472/3510 VDD_CPU 381/381 VDD_GPU 152/152 VDD_SOC 762/762 VDD_WIFI 0/0 VDD_DDR 1080/1080
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/jetson_clocks.sh --show
SOC family:tegra186 Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu2: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu3: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=1300500000 MaxFreq=1300500000 CurrentFreq=1300500000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1
Fan: speed=255
root@tegra-ubuntu:/opt/logostech# cat /proc/cpuinfo
processor : 0
model name : ARMv8 Processor rev 3 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 3
processor : 1
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x003
CPU revision : 0
MTS version : 40418221
processor : 2
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x003
CPU revision : 0
MTS version : 40418221
processor : 3
model name : ARMv8 Processor rev 3 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 3
processor : 4
model name : ARMv8 Processor rev 3 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 3
processor : 5
model name : ARMv8 Processor rev 3 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 3
root@tegra-ubuntu:/opt/logostech# sysbench --test=cpu run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing CPU performance benchmark
Threads started!
Done.
Maximum prime number checked in CPU test: 10000
Test execution summary:
total time: 5.0561s
total number of events: 10000
total time taken by event execution: 5.0538
per-request statistics:
min: 0.50ms
avg: 0.51ms
max: 0.68ms
approx. 95 percentile: 0.52ms
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 5.0538/0.00
root@tegra-ubuntu:/opt/logostech#
And here is the Xavier:
root@tegra-ubuntu:/opt/logostech# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.1.171 netmask 255.255.255.0 broadcast 172.17.1.255
inet6 fe80::849e:40b2:212f:3c89 prefixlen 64 scopeid 0x20<link>
ether 00:04:4b:cb:90:07 txqueuelen 1000 (Ethernet)
RX packets 54373888 bytes 6390622233 (6.3 GB)
RX errors 0 dropped 14 overruns 0 frame 0
TX packets 162813 bytes 19834536 (19.8 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 40
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/tegrastats
RAM 3974/15039MB (lfb 2190x4MB) CPU [0%@2255,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2224,0%@2265] EMC_FREQ 0%@2133 GR3D_FREQ 0%@1377 APE 150 MTS fg 0% bg 0% AO@26.5C GPU@28C Tboard@27C Tdiode@28.75C AUX@28C CPU@28.5C thermal@28.15C PMIC@100C GPU 1232/1232 CPU 462/462 SOC 2464/2464 CV 0/0 VDDRQ 0/0 SYS5V 3416/3416
RAM 3974/15039MB (lfb 2190x4MB) CPU [0%@2235,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265] EMC_FREQ 0%@2133 GR3D_FREQ 0%@1377 APE 150 MTS fg 0% bg 0% AO@26C GPU@28C Tboard@27C Tdiode@29C AUX@28C CPU@28.5C thermal@28.15C PMIC@100C GPU 1232/1232 CPU 462/462 SOC 2464/2464 CV 0/0 VDDRQ 0/0 SYS5V 3416/3416
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/jetson_clocks.sh --show
SOC family:tegra194 Machine:jetson-xavier
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu1: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu2: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu3: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu4: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu5: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu6: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu7: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
root@tegra-ubuntu:/opt/logostech# nvpmodel -q
NV Power Mode: MAXN
0
root@tegra-ubuntu:/opt/logostech# cat /proc/cpuinfo
processor : 0
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 1
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 2
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 3
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 4
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 5
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 6
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
processor : 7
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 43226549
root@tegra-ubuntu:/opt/logostech# sysbench --test=cpu --events=10000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 10000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 1929.09
General statistics:
total time: 5.1806s
total number of events: 10000
Latency (ms):
min: 0.50
avg: 0.52
max: 1.63
95th percentile: 0.54
sum: 5172.01
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 5.1720/0.00
root@tegra-ubuntu:/opt/logostech#
EDIT:: I realized that the Ubuntu v18 of the tool was not stopping at 10000 events. I added that and now we can see apples to apples. Xavier is still a bit slower but now by about 100ms, which is more often what I see in my application.
root@tegra-ubuntu:/opt/logostech# sysbench --num-threads=16 --test=fileio --file-total-size=3G --file-test-mode=rndrw --max-requests=10000 run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 16
Extra file open flags: 0
128 files, 24Mb each
3Gb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Done.
Operations performed: 6006 Read, 4005 Write, 12806 Other = 22817 Total
Read 93.844Mb Written 62.578Mb Total transferred 156.42Mb (66.189Mb/sec)
4236.11 Requests/sec executed
Test execution summary:
total time: 2.3633s
total number of events: 10011
total time taken by event execution: 0.2190
per-request statistics:
min: 0.01ms
avg: 0.02ms
max: 4.25ms
approx. 95 percentile: 0.02ms
Threads fairness:
events (avg/stddev): 625.6875/113.04
execution time (avg/stddev): 0.0137/0.00
Xavier:
r
oot@tegra-ubuntu:/opt/logostech# sysbench --threads=16 fileio --file-total-size=3G --file-test-mode=rndrw --events=10000 run
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Initializing random number generator from current time
Extra file open flags: 0
128 files, 24MiB each
3GiB total file size
Block size 16KiB
Number of IO requests: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...
Threads started!
File operations:
reads/s: 2520.58
writes/s: 1680.39
fsyncs/s: 5377.23
Throughput:
read, MiB/s: 39.38
written, MiB/s: 26.26
General statistics:
total time: 2.3795s
total number of events: 22800
Latency (ms):
min: 0.00
avg: 1.67
max: 227.86
95th percentile: 7.17
sum: 38004.17
Threads fairness:
events (avg/stddev): 1425.0000/236.61
execution time (avg/stddev): 2.3753/0.00
Is there another tool you’d like me use? I simply goodle this one as the way to get the quickest result. What tool does NVIDIA use to do their comparisons?
Please update the JetPack to version 4.1.1 (L4T 31.1). I have not flashed by Xavier with this option but I know that 4.1.1 has new dtsi’s and dtb ending in maxn.
If this is going to help increase the performance then I should flash my Xavier with this option.