Port ethernet performance

Hi,

we are currently evaluating TCP/IP performance on the TX2. For that we use a second x64 system running Ubuntu 16.04 to send a constant stream of data to the TX2 using the socket API.

What we do see is that the performance is quite good over a period of time and then collapses. Sometimes the throughput increases again. Sometimes the throughput continues to be erratic resp. bad.

When running top at the same time one can see the that the software irq value for CPU0 is constantly at or near 100% (si column). When looking at /proc/softirqs one can see it’s NET_RX softirq which causes these 100%. So the current working hypothesis is that the system is operating at the upper limit when receiving. And once another process eats some CPU time the throughput collapses. Can this be true? And is there is a way to optimize/stabilize ethernet performance?

Thanks!

CPU0 is the only core which can service hardware interrupts. Too many interrupts at once is “IRQ starvation”.

You would want to make sure the system isn’t just losing performance due to going into some low power save mode. If not already done, try “sudo nvpmodel -m0” and “sudo ~ubuntu/jetson_clocks.sh” before your experiment starts. If you are using anything with USB perhaps add this to the kernel command line in extlinux.conf:

usbcore.autosuspend=-1

For performance reasons a typical driver might separate its function between hardware-only components and software components…meaning the first part of an interrupt would service the hardware with the minimum time, and then re-issue a software IRQ (e.g., ksoftirqd sees software IRQs, but not hardware IRQs) to finish processing whatever the hardware IRQ produced (software IRQs can migrate anywhere…hardware IRQs can only be serviced by cores wired to the device).

If a driver does more upon interrupt by hardware than just service the hardware half, followed by migrating to a second software interrupt, then the driver may be using more time than it needs on CPU0 and might imply a need to redesign. If there are simply too many hardware interrupts being issued, then you’d have to cut out some of those drivers which are competing. It’d be hard to say without profiling to know what is competing.

robert4f5p3,

Looks like an issue. Do you have any statistics for this to prove performance is indeed dropping?

We are currently doing some analysis and collect more data. Will let you know if there are any interesting findings.

What I know so far is that the driver for the chipset (BCM54610) saw some modifications. Just compare
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/drivers/net/phy/broadcom.c?h=v4.16-rc5
with
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/drivers/net/phy/broadcom.c?h=v4.4

Hi,

we did some more investigation. What we do see is in general a good RX performance on the TX2. Unfortunately the throughput drops every ~5 seconds.

Here (https://pastebin.com/5Xa70QVK) is sample code to illustrate the issue. It should run out of the box on Linux or Windows OS and compiles with gcc and Visual Studio compiler. The executable can run in client or server mode. In a loop a message is send from the client to the server. To start the server on the TX2 execute the program without arguments. To start as client on a second Linux or Windows systems use TX2’s ip address as argument. On the TX2 the throughput is logged to the console. As you can see it decreases repeatedly. The TX2 operates in Max-N mode. We run jetson_clocks.sh to disable CPU throttling.

In a pcap dump of we see frequently duplicate TCP ACK’s or TCP retransmission. As this is in indicator for a problem on the data layer we exchanged every hardware component involved (TX2, sender, switch). We even build a second setup (on a different continent) and see the same issue. So we are sure it’s not a problem with our hardware setup. We are testing with the developer board and with a J140 from Auvidea. No difference.

We found a thread on kernel mailing list (https://patchwork.kernel.org/patch/9411213/) which describes a similar problem with a Synopsis IP core. Here disabling Energy Efficient Ethernet (EEE) solved the problem. But as ethtool is telling me EEE isn’t available in the Synopsis driver for the TX2.

So question: do you have any idea what could cause these drops?

Thanks!

robert4f5p3,

I meet some connection error in your sample url. Could you share it in attachment?

To be honest I cannot see how to attach ;-) So I c&p the whole code block:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <float.h>
#include <errno.h>  

#if defined (_MSC_VER )
#   include <WinSock2.h>
#   include <Ws2tcpip.h>
#   pragma comment(lib, "ws2_32.lib")
#   define INETPTON InetPton
#else
#   include <unistd.h>
#   include <sys/socket.h>
#   include <arpa/inet.h>
#   include <netinet/tcp.h>
#   define INETPTON inet_pton
#endif

#define SERVER_PORT             36547
#define DATA_BUFFER_SIZE        (1024 * 1024 * 4)         /* size of data content buffer that will be sent repeatedly */
#define REPLY_BUFFER_SIZE       (1024)
#define SOCKET_BUFFER_SIZE      (512 * 1024)

int dorecv(int socket_desc, char* buffer, int bufSize);
int dosend(int socket_desc, char* buffer, int bufSize);
int runServer();
int runClient(const char* serverIp);
double tick_sec();
double currentLoad();

int main(int argc , char *argv[])
{

#if defined (_MSC_VER )
    WSADATA wsa;
    if (WSAStartup(MAKEWORD(2, 0), &wsa))
    {
        printf("WSAStartup error\n");
        return -1;
    }
#endif

    if (argc > 1)
    {
        printf("Run as client\n");
        runClient(argv[1]);
    }
    else
    {
        printf("Run as server\n");
        runServer();
    }

    printf("Program ends!\n");

    return 0;
}

int dorecv(int socket_desc, char* buffer, int bufSize)
{
    int bytesRead = 0;

    while (bytesRead < bufSize)
    {
        int r = recv(socket_desc , buffer + bytesRead, bufSize - bytesRead, 0);

        if (r <= 0)
        {
            printf("Could not receive (%s)\n", strerror(errno));
            return -1;
        }

        bytesRead += r;
    }

    return 0;
}

int dosend(int socket_desc, char* buffer, int bufSize)
{
    int written = 0, w;

    while (written < bufSize)
    {
        if ((w = send(socket_desc , buffer + written, bufSize - written, 0)) < 0)
        {
            printf("Could not send (%s)\n", strerror(errno));
            return -1;
        }

        written += w;
    }

    return 0;
}

double tick_sec()
{
#if defined (_MSC_VER )

    static LARGE_INTEGER frequency;
    LARGE_INTEGER now;

    if (frequency.QuadPart == 0)
    {
        QueryPerformanceFrequency(&frequency);
    }  

    QueryPerformanceCounter(&now);
    return now.QuadPart / (double)frequency.QuadPart;

#else
    struct timespec tp; 
    
    if (clock_gettime(CLOCK_MONOTONIC, &tp) != 0)
    {
        return 0;
    }

    return (double)tp.tv_sec + (double)tp.tv_nsec / 1000000000.0; 
#endif
}

int runServer()
{
    int sockt, sockfd, c;
    struct sockaddr_in server, client;
    socklen_t optLen = sizeof(int);
    char* buffer = malloc(DATA_BUFFER_SIZE);
    int sz = SOCKET_BUFFER_SIZE;
    unsigned int i = 0;
    double minVal = DBL_MAX, maxVal = DBL_MIN;
    int shitCnt = 0;

    if (buffer == NULL)
    {
        printf("Could not allocate memory\n");
        return -1;
    }

    //Create socket
    sockt = socket(AF_INET , SOCK_STREAM , 0);
    if (sockt == -1)
    {
        printf("Could not create socket\n");
        return -1;
    }
     
    //Prepare the sockaddr_in structure
    server.sin_family = AF_INET;
    server.sin_addr.s_addr = INADDR_ANY;
    server.sin_port = htons( SERVER_PORT );

    //Bind
    if( bind(sockt, (struct sockaddr *)&server , sizeof(server)) < 0)
    {
        //print the error message
        printf("bind failed. Error\n");
        return 1;
    }

    //Listen
    listen(sockt , 1);
     
    //Accept and incoming connection
    printf("Waiting for incoming connections...\n");

    //accept connection from an incoming client
    c = sizeof(struct sockaddr_in);
    sockfd = accept(sockt, (struct sockaddr *)&client, (socklen_t*)&c);
    if (sockfd < 0)
    {
        printf("accept failed\n");
        return -1;
    }
    
    printf("Connection accepted, start receiving\n"); 

    if (setsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, (char*)&sz, optLen) == -1)
    {
        printf("Could not set receive buffer size\n");
        return -1; 
    }

    sz = 1;
    if (setsockopt(sockfd, IPPROTO_TCP, TCP_NODELAY, (char*)&sz, optLen))
    {
        printf("Could not set TCP_NODELAY\n");
        return -1; 
    }

    //Receive a message from client
    while (1)
    {
        double startTime, endTime, val;

        startTime = tick_sec();
        if (dorecv(sockfd, buffer, DATA_BUFFER_SIZE))
        {
            printf("Stop server loop\n");
            return -1;
        }

        endTime = tick_sec();
        val = ((double)DATA_BUFFER_SIZE / 1048576.0) / (endTime - startTime);

        if      (val < minVal) minVal = val;
        else if (val > maxVal) maxVal = val;

        printf("%.2lf\n", val);
        //printf("%.2lf\t%.2lf\t%.2lf\t%.2lf\n", val, currentLoad(), minVal, maxVal);
        //if (val < 100) printf ("%.2lf (%i)", val, ++shitCnt);

        if (dosend(sockfd, buffer, REPLY_BUFFER_SIZE))
        {
            printf("Stop server loop\n");
            return -1;
        }
    }
     
    printf("\n");

    return 0;
}

int runClient(const char* serverIp)
{
    int sockfd = 0, n = 0;
    char* buffer = (char*)malloc(DATA_BUFFER_SIZE);
    struct sockaddr_in serv_addr; 
    unsigned int iter = 0;
    int sz = SOCKET_BUFFER_SIZE;
    int optLen = sizeof(int);

    setvbuf(stdout, NULL, _IONBF, 0);
    setvbuf(stderr, NULL, _IONBF, 0);

    if(serverIp == NULL)
    {
        printf("No server ip given!\n");
        return 1;
    }

    if (buffer == NULL)
    {
        printf("Could not allocate memory\n");
        return -1;
    }

    memset(buffer, '0', sizeof(buffer));

    if((sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0)
    {
        printf("Could not create socket\n");
        return -1;
    } 

    memset(&serv_addr, '0', sizeof(serv_addr)); 

    serv_addr.sin_family = AF_INET;
    serv_addr.sin_port = htons(SERVER_PORT); 

    if(INETPTON(AF_INET, serverIp, &serv_addr.sin_addr)<=0)
    {
        printf("inet_pton error occured\n");
        return 1;
    }

    if(connect(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0)
    {
       printf("Connect failed\n");
       return -1;
    }

    if (setsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, (char*)&sz, optLen) == -1)
    {
        printf("Could not set send buffer size\n");
        return -1; 
    }

    printf("\nStart send\n");

    while ( 1)
    {
        if (dosend(sockfd, buffer, DATA_BUFFER_SIZE))
        {
            printf("Stop client loop\n");
            return -1;
        }

        if ((iter % 50) == 0)
        {
            printf("\riter %i", iter + 1);
        }

        iter++;

        if (dorecv(sockfd, buffer, REPLY_BUFFER_SIZE))
        {
            printf("Stop client loop\n");
            return -1;
        }
    }

    printf("\n");

    return 0;
}

double currentLoad()
{
    double usage;

#if defined (_MSC_VER)
    usage = 0;
#else
    double loadavg;
    double sum = 0, idle = 0;
    double vals[4];
    static double lastSum = 0, lastIdle = 0;
    static int isInit = 0;
    FILE* file = NULL;

    file = fopen("/proc/stat", "r");
    if (file == NULL)
    {
        return 0;
    }
    
    fscanf(file, "%*s %lf %lf %lf %lf", &vals[0], &vals[1], &vals[2], &vals[3]);

    idle = vals[3];
    sum = vals[0] + vals[1] + vals[2] + vals[3];

    if (isInit)
    {
        usage = (1.f - (idle-lastIdle) * 1.f / (sum-lastSum)) * 100.f;
    }
    else
    {
        usage = 0;
        isInit = 1;
    }

    lastSum = sum;
    lastIdle = idle;      
    
    fclose(file);
#endif
    return usage;
}

Hi robert4f5p3,

Thanks for sharing.

You can see the paper clip button on the upper right once you have pushed out a comment. It is also fine to share with code block. Does this require a makefile? Just want to make sure I didn’t miss anything. (Though it looks like only the standard lib)

No makefile. Just call gcc or cl (cl in the Visual Studio Developer Command Prompt).
testcode.c (7.68 KB)

robert4f5p3,

I forgot to ask what is your BSP? Are you using rel-28.1 or rel-28.2?

Tried your sample and below is my log.

10.98 | 111.10 | 110.10 | 109.66 | 111.06 | 111.05 | 71.56 | 67.10 | 85.19 | 110.99 | 110.44 | 110.96 | 44.86 | 72.04 | 110.67.

I think those lower values are what you pointed out. Could you reproduce this issue by using iperf3?

Just want to prove this through a formal way because most of our throughput tests are under iperf3.

robert4f5p3,

Have you tried with iperf3?

I can see this with iperf3, too:

Server is running on the Tx2. Client output:

iperf3 -p 3001 -c 10.0.0.160  -i 0.1
[  4]   1.20-1.30   sec  11.4 MBytes   952 Mbits/sec
[  4]   1.30-1.40   sec  11.2 MBytes   944 Mbits/sec
[  4]   1.40-1.50   sec  11.4 MBytes   955 Mbits/sec
[  4]   1.50-1.60   sec  10.2 MBytes   858 Mbits/sec
[  4]   1.60-1.70   sec  8.75 MBytes   736 Mbits/sec
[  4]   1.70-1.80   sec  10.9 MBytes   909 Mbits/sec
[  4]   1.80-1.90   sec  11.2 MBytes   941 Mbits/sec
[  4]   1.90-2.00   sec  11.4 MBytes   955 Mbits/sec
[  4]   2.00-2.10   sec  10.4 MBytes   871 Mbits/sec
[  4]   2.10-2.20   sec  10.0 MBytes   839 Mbits/sec
[  4]   2.20-2.30   sec  11.4 MBytes   958 Mbits/sec

[ 4] 1.60-1.70 sec 8.75 MBytes 736 Mbits/sec -> Do you mean this one?

Could you try this with -l 16k param?

iperf3 -c <server> -u -b 0 -l 16K (default length is 8K) -t 120 -i 1

Hey,

will do. Currently we try to improve performance using sysfs. I will report any findings.

Thanks!

Ok, here are the results.

  • The first big improvement comes with increased buffer sizes (net.core.rmem_max/net.core.wmem_max). Of course on has to use these bigger buffers with SO_RCVBUF/SO_SNDBUF.
  • We found disabling Generic Receive Offload greatly improved performance when receiving from a specific device (no TX2).

That’s all I can tell.

robert4f5p3,

I am glad you have found some solution. But be honest, what I concern is if this issue can be reproduced on jetson-tx2 and how to reproduce.

Could you share more detail for your experiment? Is your experiment using iperf as test app?

You mention “SO_RCVBUF/SO_SNDBUF”, so are you still doing your own socket programming?

Maybe the unstable result in iperf can be resolved by setting net.core.rmem_max/net.core.wmem_max.

Please kindly share more. Thanks!

See this comment for the source:
https://devtalk.nvidia.com/default/topic/1030812/jetson-tx2/port-ethernet-performance/post/5247798/#5247798

I still use sockets, yes. I assume iperf3 does nothing different ;-)
Actually it’s no surprise that increasing the socket buffer sizes fixes the problem.

Bye