CPU performance is worse on the Xavier then the TX2

Hi!

I’ve got a multi-threaded application that I’ve profiled. I can see that my GPU kernels are faster on the Xavier than the TX2. But overall, my application seems to run about 30-50ms slower on the Xavier. By breaking down what tasks are CPU based, I can see that CPU heavy tasks are often slower. For example, I have to poll a driver to see when a task with an external FPGA is completed. The FPGA does this task in the same amount of time regardless of what NVIDIA board I am using. So I am surmising that the slower down must be due to the scheduling of the polling wake up in Linux. A second example is the PCIe transfer times, they are also slower on the Xavier than the TX2 and we went from x4 lane to x8 lane. The driver uses a copy_from_user call, which is strictly CPU performance. So since the PCIe transfer should be physically faster, the slow down must be on the CPU side. Lastly, I have some large data manipulation loops that run on the CPU and these are slower too, albeit by only a few millieseconds. The worst offender in this case, does a lot of memory allocation.

I have everything at max power and frequency. My application only uses 50% of the RAM available.

ubuntu@tegra-ubuntu:~$ sudo nvpmodel -q
NV Power Mode: MAXN
0
ubuntu@tegra-ubuntu:~$ sudo ./jetson_clocks.sh --show
[sudo] password for ubuntu: 
SOC family:tegra194  Machine:jetson-xavier
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu1: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu2: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu3: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu4: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu5: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu6: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu7: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
ubuntu@tegra-ubuntu:~$

Can someone comment on why this is? Is there a governor that I didn’t turn off or has NVIDIA noticed this behavior too.

Thanks for you advice,
Brandy

Hi BrandyJ,

Could you try if any benchmark tool can tell the cpu performance concern from tx2 to xavier?

Hello vickyy,

Sure. I install sysbench and ran it.

Here is the TX2:

root@tegra-ubuntu:/opt/logostech# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:04:4b:a7:f7:91  
          inet addr:172.17.1.138  Bcast:172.17.1.255  Mask:255.255.255.0
          inet6 addr: fe80::abf0:6186:57ca:2c82/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:24167643 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15600 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2973211419 (2.9 GB)  TX bytes:3273322 (3.2 MB)
          Interrupt:42 
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/tegrastats 
RAM 1626/6848MB (lfb 984x4MB) CPU [0%@2034,0%@2035,0%@2034,0%@2035,0%@2034,0%@2034] EMC_FREQ 0%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 1% bg 13% BCPU@31C MCPU@31C GPU@29.5C PLL@31C Tboard@26C Tdiode@27.5C PMIC@100C thermal@30.4C VDD_IN 3548/3548 VDD_CPU 381/381 VDD_GPU 152/152 VDD_SOC 762/762 VDD_WIFI 0/0 VDD_DDR 1080/1080
RAM 1627/6848MB (lfb 984x4MB) CPU [0%@2035,0%@2034,0%@2035,0%@2036,0%@2035,0%@2034] EMC_FREQ 0%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@31C MCPU@31C GPU@29.5C PLL@31C Tboard@26C Tdiode@27.5C PMIC@100C thermal@30.4C VDD_IN 3472/3510 VDD_CPU 381/381 VDD_GPU 152/152 VDD_SOC 762/762 VDD_WIFI 0/0 VDD_DDR 1080/1080
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/jetson_clocks.sh --show
SOC family:tegra186  Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu2: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu3: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=1300500000 MaxFreq=1300500000 CurrentFreq=1300500000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1
Fan: speed=255
root@tegra-ubuntu:/opt/logostech# cat /proc/cpuinfo 
processor	: 0
model name	: ARMv8 Processor rev 3 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x1
CPU part	: 0xd07
CPU revision	: 3

processor	: 1
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x003
CPU revision	: 0
MTS version	: 40418221

processor	: 2
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x003
CPU revision	: 0
MTS version	: 40418221

processor	: 3
model name	: ARMv8 Processor rev 3 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x1
CPU part	: 0xd07
CPU revision	: 3

processor	: 4
model name	: ARMv8 Processor rev 3 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x1
CPU part	: 0xd07
CPU revision	: 3

processor	: 5
model name	: ARMv8 Processor rev 3 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x1
CPU part	: 0xd07
CPU revision	: 3

root@tegra-ubuntu:/opt/logostech# sysbench --test=cpu run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000

Test execution summary:
    total time:                          5.0561s
    total number of events:              10000
    total time taken by event execution: 5.0538
    per-request statistics:
         min:                                  0.50ms
         avg:                                  0.51ms
         max:                                  0.68ms
         approx.  95 percentile:               0.52ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   5.0538/0.00

root@tegra-ubuntu:/opt/logostech#

And here is the Xavier:

root@tegra-ubuntu:/opt/logostech# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.1.171  netmask 255.255.255.0  broadcast 172.17.1.255
        inet6 fe80::849e:40b2:212f:3c89  prefixlen 64  scopeid 0x20<link>
        ether 00:04:4b:cb:90:07  txqueuelen 1000  (Ethernet)
        RX packets 54373888  bytes 6390622233 (6.3 GB)
        RX errors 0  dropped 14  overruns 0  frame 0
        TX packets 162813  bytes 19834536 (19.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 40  
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/tegrastats 
RAM 3974/15039MB (lfb 2190x4MB) CPU [0%@2255,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2224,0%@2265] EMC_FREQ 0%@2133 GR3D_FREQ 0%@1377 APE 150 MTS fg 0% bg 0% AO@26.5C GPU@28C Tboard@27C Tdiode@28.75C AUX@28C CPU@28.5C thermal@28.15C PMIC@100C GPU 1232/1232 CPU 462/462 SOC 2464/2464 CV 0/0 VDDRQ 0/0 SYS5V 3416/3416
RAM 3974/15039MB (lfb 2190x4MB) CPU [0%@2235,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265] EMC_FREQ 0%@2133 GR3D_FREQ 0%@1377 APE 150 MTS fg 0% bg 0% AO@26C GPU@28C Tboard@27C Tdiode@29C AUX@28C CPU@28.5C thermal@28.15C PMIC@100C GPU 1232/1232 CPU 462/462 SOC 2464/2464 CV 0/0 VDDRQ 0/0 SYS5V 3416/3416
root@tegra-ubuntu:/opt/logostech# /home/ubuntu/jetson_clocks.sh --show
SOC family:tegra194  Machine:jetson-xavier
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu1: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu2: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu3: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu4: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu5: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu6: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu7: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
root@tegra-ubuntu:/opt/logostech# nvpmodel -q
NV Power Mode: MAXN
0
root@tegra-ubuntu:/opt/logostech# cat /proc/cpuinfo 
processor	: 0
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 1
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 2
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 3
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 4
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 5
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 6
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549

processor	: 7
model name	: ARMv8 Processor rev 0 (v8l)
BogoMIPS	: 62.50
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer	: 0x4e
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0x004
CPU revision	: 0
MTS version	: 43226549
root@tegra-ubuntu:/opt/logostech# sysbench --test=cpu --events=10000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Prime numbers limit: 10000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:  1929.09

General statistics:
    total time:                          5.1806s
    total number of events:              10000

Latency (ms):
         min:                                  0.50
         avg:                                  0.52
         max:                                  1.63
         95th percentile:                      0.54
         sum:                               5172.01

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   5.1720/0.00

root@tegra-ubuntu:/opt/logostech#

EDIT:: I realized that the Ubuntu v18 of the tool was not stopping at 10000 events. I added that and now we can see apples to apples. Xavier is still a bit slower but now by about 100ms, which is more often what I see in my application.

Thanks for your input!

So I ran the fileio test on both machines also.

TX2:

root@tegra-ubuntu:/opt/logostech# sysbench --num-threads=16 --test=fileio --file-total-size=3G --file-test-mode=rndrw --max-requests=10000 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16

Extra file open flags: 0
128 files, 24Mb each
3Gb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Done.

Operations performed:  6006 Read, 4005 Write, 12806 Other = 22817 Total
Read 93.844Mb  Written 62.578Mb  Total transferred 156.42Mb  (66.189Mb/sec)
 4236.11 Requests/sec executed

Test execution summary:
    total time:                          2.3633s
    total number of events:              10011
    total time taken by event execution: 0.2190
    per-request statistics:
         min:                                  0.01ms
         avg:                                  0.02ms
         max:                                  4.25ms
         approx.  95 percentile:               0.02ms

Threads fairness:
    events (avg/stddev):           625.6875/113.04
    execution time (avg/stddev):   0.0137/0.00

Xavier:

r
oot@tegra-ubuntu:/opt/logostech# sysbench --threads=16 fileio --file-total-size=3G --file-test-mode=rndrw --events=10000 run
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Initializing random number generator from current time

Extra file open flags: 0
128 files, 24MiB each
3GiB total file size
Block size 16KiB
Number of IO requests: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

File operations:
    reads/s:                      2520.58
    writes/s:                     1680.39
    fsyncs/s:                     5377.23

Throughput:
    read, MiB/s:                  39.38
    written, MiB/s:               26.26

General statistics:
    total time:                          2.3795s
    total number of events:              22800

Latency (ms):
         min:                                  0.00
         avg:                                  1.67
         max:                                227.86
         95th percentile:                      7.17
         sum:                              38004.17

Threads fairness:
    events (avg/stddev):           1425.0000/236.61
    execution time (avg/stddev):   2.3753/0.00

Is there another tool you’d like me use? I simply goodle this one as the way to get the quickest result. What tool does NVIDIA use to do their comparisons?

Thanks,
Brandy

Hi BrandyJ,

Please flash Xavier with below cfg:

sudo ./flash <b>jetson-xavier-maxn</b> mmcblk0p1

Set max performance and test again:

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

Thanks!

Hi Carolyuu,

Thanks!

I tried this, however, I guess I need a new jetpack. We are using Jetpack 4.1 (L4T 31.0.2) and it gives me this error:

Error: Invalid target board - jetson-xavier-maxn

What version will work with jetson-xavier-maxn?

Thanks,
Brandy

Hi BrandyJ,

Please update the JetPack to version 4.1.1 (L4T 31.1). I have not flashed by Xavier with this option but I know that 4.1.1 has new dtsi’s and dtb ending in maxn.

If this is going to help increase the performance then I should flash my Xavier with this option.

Hi BrandyJ,

Please use JetPack-4.1.1 (r31.1). Thanks!

Hi,

Thanks. I am working on updating my jetpack.