High CPU usage on Jetson TX2 with GigE fully loaded

Hello,

I am using two embedded boards connected to each other with wired GigE link: Jetson TX2 with Orbitty Carrier and FriendlyArm NanoPi Neo4.
Static IPV4 addresses assigned on both sides of this link.

I’m quite upset with CPU usage on Jetson TX2 side while running TCP/IP upload test program using all available bandwidth (code attached):

Nvidia Jetson TX2, Running MAXQ, jetson_clocks.sh executed

gyrolab@jetson-ai:~$ pidstat -u -p 24715 2
Linux 4.9.140-tegra (jetson-ai) 	22.05.20 	_aarch64_	(6 CPU)

19:38:08      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
19:38:10     1000     24715    0,50   29,50    0,00    2,00   30,00     5  tcpclient
19:38:12     1000     24715    0,50   29,50    0,00    2,50   30,00     3  tcpclient
19:38:14     1000     24715    0,50   28,36    0,00    1,00   28,86     5  tcpclient
19:38:16     1000     24715    0,50   29,50    0,00    1,50   30,00     5  tcpclient

Running this code vise versa (client on Nanopi’s side):

pi@nanopineo4:~$ pidstat -u -p 6944 5
Linux 5.6.11-rockchip64 (nanopineo4) 	05/22/20 	_aarch64_	(6 CPU)

16:46:01      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
16:46:06     1000      6944    0.20   14.20    0.00    0.00   14.40     1  tcpclient
16:46:11     1000      6944    0.20   14.20    0.00    0.00   14.40     1  tcpclient
16:46:16     1000      6944    0.00   14.40    0.00    0.00   14.40     1  tcpclient
16:46:21     1000      6944    0.20   14.00    0.00    0.00   14.20     1  tcpclient

Seems like CPU usage on Jetson’s side is two times higher comparing with old FriendlyArm’s H5 CPU for the same task.

Is this performance issue related to old 4.9 kernel, or this is driver side issue or something else?
I would be glad for any advice for related to optimization Jetson TX’2 wired network performance.

Sergiy

tcpserver.c

#include <unistd.h>
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <sys/socket.h> 
#include <arpa/inet.h>
#include <sys/time.h>

#define BUFSIZE (1024 * 1024)
#define PORT 8091 
#define SA struct sockaddr 
  
void func(int sockfd) 
{ 
    char * buff; 

    buff = malloc(BUFSIZE);    
    static struct timeval start_ts, ts;    
    uint32_t tx_size = 1024 * 1024;       
    uint32_t i = 0;    
    uint32_t tm;    

    fprintf(stderr, "Socket read started\n");  
    read(sockfd, buff, 0);
    gettimeofday (&start_ts, NULL);        
    fprintf(stderr, "Start NULL-packet received\n");      
        
    for (;;) { 
        gettimeofday (&ts, NULL);
        tm = ((ts.tv_sec - start_ts.tv_sec) * 1000000 +
                        (ts.tv_usec - start_ts.tv_usec)) / 1000;		

        if (i % 100 == 0)
            fprintf(stderr, "%03d.%03d.Sending TX Buffer %d, size = %d\n", 
                tm / 1000, tm % 1000, i, tx_size);
        write(sockfd, buff, tx_size);
        i++;         
    } 
} 
  
int main() 
{ 
    socklen_t sockfd, connfd, len; 
    struct sockaddr_in servaddr, cli; 
  
    // socket create and verification 
    sockfd = socket(AF_INET, SOCK_STREAM, 0); 
    if (sockfd == -1) { 
        fprintf(stderr, "socket creation failed...\n"); 
        exit(0); 
    } 
    else
        fprintf(stderr, "Socket successfully created..\n"); 
    bzero(&servaddr, sizeof(servaddr)); 
  
    // assign IP, PORT 
    servaddr.sin_family = AF_INET; 
    servaddr.sin_addr.s_addr = htonl(INADDR_ANY); 
    servaddr.sin_port = htons(PORT); 
  
    // Binding newly created socket to given IP and verification 
    if ((bind(sockfd, (SA*)&servaddr, sizeof(servaddr))) != 0) { 
        fprintf(stderr, "socket bind failed...\n"); 
        exit(0); 
    } 
    else
        fprintf(stderr, "Socket successfully binded..\n"); 
  
    // Now server is ready to listen and verification 
    if ((listen(sockfd, 5)) != 0) { 
        fprintf(stderr, "Listen failed...\n"); 
        exit(0); 
    } 
    else
        fprintf(stderr, "Server listening..\n"); 
    len = sizeof(cli); 
  
    // Accept the data packet from client and verification 
    connfd = accept(sockfd, (SA*)&cli, &len); 
    if (connfd < 0) { 
        fprintf(stderr, "server acccept failed...\n"); 
        exit(0); 
    } 
    else
        fprintf(stderr, "server acccept the client...\n"); 
  
    func(connfd); 
    close(sockfd); 
} 

tcpclient.c

#include <unistd.h>
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <sys/socket.h> 
#include <arpa/inet.h>
#include <sys/time.h>

#define SERVER_IP "10.10.75.2"
#define BUFSIZE (1024 * 1024)
#define PORT 8091 
#define SA struct sockaddr 

void func(int sockfd) 
{ 
    char * buff; 	
    static struct timeval start_ts, ts;
	
    buff = malloc(BUFSIZE);
    memset(buff, 0xFA, BUFSIZE);

    uint32_t i = 0;
    uint32_t tm;
    
    write(sockfd, buff, 0);    
    fprintf(stderr, "Start NULL-packet sent\n");          
    gettimeofday (&start_ts, NULL);
        
    for (;;) { 
        gettimeofday (&ts, NULL);
        tm = ((ts.tv_sec - start_ts.tv_sec) * 1000000 +
                        (ts.tv_usec - start_ts.tv_usec)) / 1000;		
        
        uint32_t size = 0;
        while (size < 1024 * 1024) {
            size += read(sockfd, buff, BUFSIZE - size); 
        }

        if (i % 100 == 0)
            fprintf(stderr, "%03d.%03d.Received RX Buffer %d, size = %d\n", 
                tm / 1000, tm % 1000, i, size);        
        i++;
    } 
} 
  
int main() 
{ 
    int sockfd; 
    struct sockaddr_in servaddr; 
  
    // socket create and varification 
    sockfd = socket(AF_INET, SOCK_STREAM, 0); 
    if (sockfd == -1) { 
        fprintf(stderr, "Socket creation failed...\n"); 
        exit(0); 
    } 
    else
        fprintf(stderr, "Socket successfully created..\n"); 
    bzero(&servaddr, sizeof(servaddr)); 
  
    // assign IP, PORT 
    servaddr.sin_family = AF_INET; 
    servaddr.sin_addr.s_addr = inet_addr(SERVER_IP); 
    servaddr.sin_port = htons(PORT); 
  
    // connect the client socket to server socket 
    if (connect(sockfd, (SA*)&servaddr, sizeof(servaddr)) != 0) { 
        fprintf(stderr, "Connection with the server failed...\n"); 
        exit(0); 
    } 
    else
        fprintf(stderr, "Connected to the server..\n"); 
  
    func(sockfd);  
    close(sockfd); 
}

After some researching i found things even worse than described above.

System load (top) under 540MBit/s TCP RX stream on Nvidia Jetson TX2:

Tasks: 317 total,   1 running, 316 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,7 us,  4,9 sy,  0,0 ni, 79,8 id,  0,0 wa,  1,9 hi, 12,7 si,  0,0 st
KiB Mem :  8049600 total,  4914532 free,  1873408 used,  1261660 buff/cache
KiB Swap:  4024784 total,  4024784 free,        0 used.  6365084 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21732 gyrolab   20   0    3192   2276   1180 S  18,5  0,0   0:02.72 tcpclient
    3 root      20   0       0      0      0 S   4,3  0,0  21:54.16 ksoftirqd/0
 6200 root      20   0 24,221g  66764  38732 S   2,0  0,8   8:03.83 Xorg
 6935 gyrolab   20   0 1195072 220572  70464 S   1,3  2,7   8:58.56 compiz
21738 root      20   0   10628   3896   3204 R   1,0  0,0   0:00.11 top

Comparing with “98.8 id” with no ethernet load it makes 98.8 - 79.8 = huge 19% CPU load just for 540MBit/s
Downloading at 980MBit/s raises system load to ~27% of CPU.

I guess it is very high price to pay 27% machine CPU resourses for 1Gbit/s TCP download.

12.7% System Interrupt load looks very high for ~11k softirqs/sec (checked with cat /proc/softirqs periodically)
Playing on software level with setsockopt SO_RCVBUF with various buffer sizes does not improves anything as well as changing sysctl net.core.rmem_max / wmem_max up to 32MBytes.
Playing with coalesce parameters for eth0 to reduce softirqs rate is not possible because it is not supported by ethernet driver.
Enabling jumbo frames with ifconfig eth0 mtu 9000 limits reduces link speed by 3-4 times and increases CPU load.

top output from FriendlyArm with running TCP stream vise versa (client on FriendlyArm side):

Tasks: 138 total,   1 running, 137 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.6 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    982.6 total,    749.5 free,    118.5 used,    114.6 buff/cache
MiB Swap:    491.3 total,    491.3 free,      0.0 used.    782.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   6314 pi        20   0    3144   2388   1316 S  11.3   0.2   0:08.53 tcpclient
   6317 pi        20   0   10528   3140   2616 R   1.0   0.3   0:00.04 top
     25 root      20   0       0      0      0 S   0.3   0.0   0:12.71 ksoftirqd/3
      1 root      20   0  166628   9732   7180 S   0.0   1.0   0:04.12 systemd

99.9 - 98.3 = just 1.6% CPU load. Link load 540MBit/s (checked with bmon --use-bit eth0)

Summary:
Why running same Ethernet link load creates 19% CPU load on Jetson TX2 / L4T / Kernel 4.9 and only 1.6% CPU load on FriendlyArm / Armbian / Kernel 5.6?
Is it something wrong with L4T ethernet eqos driver, or something related to kernel or kernel configuration?
Is it possible to reduce Jetson’s CPU load without reducing eth0 link budget on Jetson TX2?

PS Sorry for the long posts.

Hi

I think we can use iperf to identify the performance issue is caused by the code or platform.
Could you use try it?

Thank you for your participation.

10.10.75.1 - Jetson TX2 / kernel 4.9.140-tegra
10.10.75.2 - FriendlyArm NanoPi Neo4 / kernel 5.6.11-rockchip64

Both SBC’s connected to each other with straight patch-cord.

CPU usage on Jetson before iperf3 test:
%Cpu(s): 0,2 us, 0,8 sy, 0,0 ni, 98,6 id, 0,0 wa, 0,1 hi, 0,1 si, 0,0 st

Download test:

gyrolab@jetson-ai:~$ iperf3 -R -c 10.10.75.2 -t 1000
Connecting to host 10.10.75.2, port 5201
Reverse mode, remote host 10.10.75.2 is sending
[  4] local 10.10.75.1 port 59334 connected to 10.10.75.2 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   112 MBytes   942 Mbits/sec                  
[  4]   1.00-2.00   sec   112 MBytes   941 Mbits/sec                  
[  4]   2.00-3.00   sec   112 MBytes   942 Mbits/sec                  
[  4]   3.00-4.00   sec   112 MBytes   942 Mbits/sec

top - 16:52:58 up 36 min,  1 user,  load average: 1,23, 0,91, 0,90
Tasks: 310 total,   3 running, 307 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,8 us,  9,2 sy,  0,0 ni, 65,6 id,  0,0 wa,  6,0 hi, 18,5 si,  0,0 st
KiB Mem :  8049600 total,  5536844 free,  1517688 used,   995068 buff/cache
KiB Swap:  4024784 total,  4024784 free,        0 used.  6496024 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    3 root      20   0       0      0      0 R  70,4  0,0   8:07.63 ksoftirqd/0
12634 gyrolab   20   0    3208    724    608 R  37,1  0,0   0:10.16 iperf3
 6332 root      20   0 24,164g  42308  24432 S   1,6  0,5   1:39.10 Xorg
 7374 gyrolab   20   0 1452492 217144  71736 S   0,8  2,7   1:53.08 compiz
 7719 gyrolab   20   0  528400  38712  27260 S   0,6  0,5   0:42.20 gnome-terminal-
  981 root     -51   0       0      0      0 S   0,4  0,0   0:15.26 irq/53-host_syn
 1405 root     -51   0       0      0      0 S   0,4  0,0   0:09.51 irq/60-15210000

Upload test:

gyrolab@jetson-ai:~$ iperf3 -c 10.10.75.2 -t 1000
Connecting to host 10.10.75.2, port 5201
[  4] local 10.10.75.1 port 59596 connected to 10.10.75.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   109 MBytes   912 Mbits/sec    0    455 KBytes       
[  4]   1.00-2.00   sec   104 MBytes   875 Mbits/sec    0    477 KBytes       
[  4]   2.00-3.00   sec   105 MBytes   878 Mbits/sec    0    477 KBytes       
[  4]   3.00-4.00   sec   101 MBytes   847 Mbits/sec    0    477 KBytes       
[  4]   4.00-5.00   sec   101 MBytes   851 Mbits/sec    0    477 KBytes

top - 17:08:26 up 52 min,  1 user,  load average: 0,69, 0,81, 0,97
Tasks: 311 total,   1 running, 310 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,7 us,  3,6 sy,  0,0 ni, 80,6 id,  0,0 wa,  2,0 hi, 13,1 si,  0,0 st
KiB Mem :  8049600 total,  5511904 free,  1524572 used,  1013124 buff/cache
KiB Swap:  4024784 total,  4024784 free,        0 used.  6476464 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13955 gyrolab   20   0    3208    716    600 S   8,3  0,0   0:01.89 iperf3
    3 root      20   0       0      0      0 S   7,3  0,0  11:45.24 ksoftirqd/0
 6332 root      20   0 24,164g  42308  24432 S   2,2  0,5   2:13.23 Xorg
 5381 root     -51   0       0      0      0 S   2,0  0,0   0:18.26 sugov:0
 7374 gyrolab   20   0 1452492 217676  71736 S   1,0  2,7   2:34.82 compiz
 1405 root     -51   0       0      0      0 S   0,6  0,0   0:13.45 irq/60-15210000
 5924 root      20   0       0      0      0 S   0,6  0,0   0:19.28 nvgpu_channel_p

Just to compare, CPU usage on 10.10.75.2 (FriendylArm, Jetson is TX):

top - 14:15:11 up  6:23,  3 users,  load average: 0.05, 0.02, 0.00
Tasks: 140 total,   2 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  3.1 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    982.6 total,    548.9 free,    123.6 used,    310.1 buff/cache
MiB Swap:    491.3 total,    491.3 free,      0.0 used.    772.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  10784 pi        20   0    5232   2428   2116 R  18.5   0.2   0:30.78 iperf3
  10786 root      20   0       0      0      0 I   0.2   0.0   0:00.08 kworker/1:2-events
  10875 pi        20   0   10528   3152   2620 R   0.2   0.3   0:00.06 top
      1 root      20   0  166628   9740   7180 S   0.0   1.0   0:05.38 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.03 kthreadd

I am curious, if you run the test with htop running, does the excess load stay on CPU0? Or is it on a different core? I use htop since it shows a bar chart of individual core loads…I am curious about how this distributes and not just the total load (“sudo apt-get install htop”).

Could you also running tegrastats during your test? And we could see the actual CPU loading.

Also, is there any other device that can do the test instead of FriendlyArm NanoPi? For example, a ubuntu host and run ipef between them.

1 - Switched to runlevel 3 to offload graphics environment:

gyrolab@jetson-ai:~$ sudo init 3

2 - Run iperf3 test

gyrolab@jetson-ai:~$ iperf3 -c 10.10.75.2 -t 1000 -R

Connecting to host 10.10.75.2, port 5201
Reverse mode, remote host 10.10.75.2 is sending
[  4] local 10.10.75.1 port 54228 connected to 10.10.75.2 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   110 MBytes   923 Mbits/sec                  
[  4]   1.00-2.00   sec   112 MBytes   942 Mbits/sec                  
[  4]   2.00-3.00   sec   112 MBytes   941 Mbits/sec 

3 - Tegrastats

gyrolab@jetson-ai:~$ cat stat_tegrastats 

RAM 562/7861MB (lfb 1611x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1995,off,off,4%@1997,20%@1998,14%@1997] EMC_FREQ 1%@1600 GR3D_FREQ 0%@216 APE 150 PLL@42C MCPU@42C PMIC@100C Tboard@38C GPU@40C BCPU@42C thermal@40.9C Tdiode@39.75C VDD_SYS_GPU 144/144 VDD_SYS_SOC 869/869 VDD_4V0_WIFI 19/19 VDD_IN 4375/4375 VDD_SYS_CPU 965/965 VDD_SYS_DDR 844/844
RAM 562/7861MB (lfb 1611x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2001,off,off,3%@1996,20%@2003,11%@2000] EMC_FREQ 1%@1600 GR3D_FREQ 0%@216 APE 150 PLL@41.5C MCPU@41.5C PMIC@100C Tboard@38C GPU@40C BCPU@41.5C thermal@41.2C Tdiode@39.75C VDD_SYS_GPU 144/144 VDD_SYS_SOC 869/869 VDD_4V0_WIFI 19/19 VDD_IN 4375/4375 VDD_SYS_CPU 966/965 VDD_SYS_DDR 844/844

4 - CPU and IRQ stats (20 seconds period):

gyrolab@jetson-ai:~$ mpstat -P ALL -n 20

Linux 4.9.140-tegra (jetson-ai) 	25.05.20 	_aarch64_	(6 CPU)

14:30:08     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14:30:28     all    0,44    0,00    8,08    0,01    5,79   17,72    0,00    0,00    0,00   67,95
14:30:28       0    0,00    0,00    0,16    0,00   23,69   72,87    0,00    0,00    0,00    3,29
14:30:28       1    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00
14:30:28       2    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00
14:30:28       3    0,51    0,00   12,16    0,00    0,05    0,10    0,00    0,00    0,00   87,17
14:30:28       4    0,05    0,00    1,76    0,00    0,05    0,05    0,00    0,00    0,00   98,08
14:30:28       5    1,24    0,00   18,12    0,00    0,05    0,00    0,00    0,00    0,00   80,60


Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    0,44    0,00    8,08    0,01    5,79   17,72    0,00    0,00    0,00   67,95
Average:       0    0,00    0,00    0,16    0,00   23,69   72,87    0,00    0,00    0,00    3,29
Average:       1    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00
Average:       2    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00    0,00
Average:       3    0,51    0,00   12,16    0,00    0,05    0,10    0,00    0,00    0,00   87,17
Average:       4    0,05    0,00    1,76    0,00    0,05    0,05    0,00    0,00    0,00   98,08
Average:       5    1,24    0,00   18,12    0,00    0,05    0,00    0,00    0,00    0,00   80,60

5 - Top (skipped first iteration)

gyrolab@jetson-ai:~$ top -bn2 > stat_top

top - 14:32:49 up 21 min,  5 users,  load average: 1,40, 1,13, 0,68
Tasks: 236 total,   2 running, 234 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,5 us,  9,0 sy,  0,0 ni, 66,6 id,  0,0 wa,  5,9 hi, 18,0 si,  0,0 st
KiB Mem :  8049600 total,  6950164 free,   521440 used,   577996 buff/cache
KiB Swap:  4024784 total,  4024784 free,        0 used.  7547816 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    3 root      20   0       0      0      0 R  70,0  0,0   6:01.25 ksoftirqd/0
 9457 gyrolab   20   0    2856   1676   1408 S  37,3  0,0   3:12.75 iperf3
 5329 root     -51   0       0      0      0 S   0,7  0,0   0:01.04 sugov:0
 9849 gyrolab   20   0   10496   3800   3144 R   0,7  0,0   0:00.05 top
 5116 root      20   0       0      0      0 S   0,3  0,0   0:01.54 dhd_watchdog_th
 5123 root     -51   0       0      0      0 S   0,3  0,0   0:02.28 dhd_dpc
 5842 root      20   0       0      0      0 S   0,3  0,0   0:02.18 nvgpu_channel_p

Looks like 4 of 6 cores running at 2 GHz and ~1Gb TCP stream takes 1 core fully loaded and 1 core partially loaded, mostly from processing softirq / ksoftirqd/0 thread.

I’ll setup amd64 machine and post results a bit later to compare.

Note that you will find the hardware IRQs running on CPU0, but ksoftirqd can go to other cores. I am thinking you are bottlenecking CPU0 with hardware IRQs, but other cores are not bottlenecking. The implication being that you will still have plenty of CPU power for anything not requiring CPU0. I am wondering what is your use case, and if there is a problem with insufficient CPU resources, or if this was just an observation about CPU usage?

Note that soft IRQs, despite being able to distribute to any core, will often be scheduled to the same core to take advantage of cache hits. There may be times when you have a known purely software process/thread that intentionally assigning a core affinity could help.

At this stage here is no bottlenecking, just an observation. I am on design stage with my application. Raw ~850Mbit/s video stream receiver is first step, next step should be video processing application.

I agree it is possible to distribute network processing load between CPU cores. The quesion was is it possible to reduce CPU load itself.

I was wrong about jumbo frames, of course it should be set on both sides. Setting MTU to 4000-5000 reduces CPU load on one third in my particular case. Setting it more to 7000-9000 reduces link speed with no significant CPU load improvement.

Also i checked Celeron N3050/amd64 machine with Ubuntu 20.04 / Kernel 5.4, 5.6.0, 5.6.0 lowlat. Generally it performs in the same way as Jetson machine.

Seems like the quesion is not about Jetson TX2/L4T but RK3399/armbinan. The question is why does it performs with so rediciously low CPU networking usage, especially with IRQ handling, and is it possible to optimize TX2/L4T with the same way.

I can’t answer, but I think whatever limitations you get will be due to CPU0. If anything in your use-case does not require a hardware IRQ, and can either be put in ksoftirqd management, or else in user space, then you will maximize throughput. Some RT extensions might allow setting cgroups such that you can prioritize. If any non-hardware-IRQ (anything not needing to talk to the physical address of the chain of hardware in your video pipeline) can be given core affinity for one or more of the non-CPU0 cores you’ll also probably get better results. I have no particular recommendations, it is all an experiment for your particular use-case.

You may also try to increase kernel receive buffer max size on jetson for receiving:

#This might be huge, but let's try it
sudo sysctl -w net.core.rmem_max=25000000

At this moment i would like to conclude by saying there is a lot of knowleges required to optimize network stack and measure CPU load correctly.

This should be my handbook: https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

I tried to play with every option explained in an article above. Many of them are not supported by the hardware, some of them have no effects, but at least this article has a great explanation on network stack tuning.

Finally i was able to reduce displayed CPU load (mpstat) up to 77-83% IDLE with full CPU / max clock during 957 MBit/s receive with jumbo frames enabled (mtu 9000 on both sides) which makes me happy.

After a lot of tests I guess here is nothing wrong with Jetson / L4T network stack itself. Only some HW/SF/knowleges limitation which is pretty obvious.

Anyway, thanks for everyone participated, happy to work with such great community!

1 Like