Slow memmove on Jetson Nano

Hi, we have a Jetson Nano Module with 4GB of RAM and 16GB of eMMC storage.
The following scenario was tested with L4T 32.5.1 and 32.7.1, both producing the same result

I have a very simple go program that allocates 100MB of memory and then creates a second, empty slice and appends those 100MB again to it.
As expected, this causes lots of memory allocations, since the slice dynamically grows and must be copied to a new memory segment each time.
What I did not expect, is the absurd execution time of ~25 seconds.

The same code runs fine on a Xavier AGX, needing only 88ms to finish there.
EDIT: Tested it on a Xavier NX as well, 404ms.

Can somebody help me debug this or tell me, if this is expected or not?

Here is the code (go run main.go), I am using go version go1.18 linux/arm64)

package main

import (
        "fmt"
        "time"
)


func main() {
        b1 := make([]byte, 1024*1024*100)
        b2 := make([]byte, 0)

        start := time.Now()
        for _, v := range b1 {
                b2 = append(b2, v + 1)
        }
        fmt.Printf("finished after %s\n", time.Since(start).String())
}

Hi,
Please xecute sudo nvpmodel -m 0 and sudo jetson_clocks, and try again. See if there is improvement by running all CPU cores at maximum clock.

Hi, thanks for attending my problem.
I did that and reran the program, it did not change much.
The execution time varies a lot, I had everything now from 13 to 40 seconds.

I have tested 3 different Jetson Nano modules, each behaves the same.
Also tested a TX2 NX module, it needed anywhere between 8-16s

All Tests with hightest performance mode

I wrote a small C++ code sample that is hopefully equivalent to the Go program:

// Your First C++ Program

#include <iostream>
#include <chrono>
#include <vector>

int main() {
    std::vector<char> v(1024*1024*100);
    std::vector<char> v2;

    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for (const auto a : v) {
        v2.push_back(a);
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Needed: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "[ms]" << std::endl;
    return 0;
}

It needs 6,2s, and that is pretty consistent in between runs. That is very different from the Go example.

Hi,
Please run sudo tegrastats and check if CPU loading is full and caps the performance.

Hi,
This is the output of one run, 2 CPU cores are fully loaded:

RAM 693/3956MB (lfb 623x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [13%@1036,28%@1036,16%@1036,12%@1036] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@38C CPU@41.5C PMIC@100C GPU@38C AO@44C thermal@40.75C POM_5V_IN 4055/4055 POM_5V_GPU 81/81 POM_5V_CPU 1187/1187
RAM 711/3956MB (lfb 618x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [7%@1479,85%@1479,9%@1479,77%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@39.5C CPU@44C PMIC@100C GPU@41.5C AO@44.5C thermal@42.75C POM_5V_IN 3686/3870 POM_5V_GPU 0/40 POM_5V_CPU 1392/1289
RAM 715/3956MB (lfb 617x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [4%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@44.5C PMIC@100C GPU@42C AO@45.5C thermal@43.25C POM_5V_IN 3686/3809 POM_5V_GPU 0/27 POM_5V_CPU 1351/1310
RAM 721/3956MB (lfb 615x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@45C PMIC@100C GPU@42C AO@45.5C thermal@43.25C POM_5V_IN 3686/3778 POM_5V_GPU 0/20 POM_5V_CPU 1351/1320
RAM 723/3956MB (lfb 615x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40.5C CPU@45C PMIC@100C GPU@42.5C AO@46C thermal@43.5C POM_5V_IN 3686/3759 POM_5V_GPU 0/16 POM_5V_CPU 1351/1326
RAM 727/3956MB (lfb 614x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,1%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40.5C CPU@45.5C PMIC@100C GPU@42.5C AO@46C thermal@44C POM_5V_IN 3686/3747 POM_5V_GPU 0/13 POM_5V_CPU 1351/1330
RAM 731/3956MB (lfb 613x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [3%@1479,100%@1479,1%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41C CPU@45.5C PMIC@100C GPU@43C AO@46.5C thermal@44C POM_5V_IN 3686/3738 POM_5V_GPU 0/11 POM_5V_CPU 1351/1333
RAM 737/3956MB (lfb 611x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41C CPU@45.5C PMIC@100C GPU@43C AO@46.5C thermal@44.25C POM_5V_IN 3727/3737 POM_5V_GPU 0/10 POM_5V_CPU 1392/1340
RAM 741/3956MB (lfb 610x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41.5C CPU@45.5C PMIC@100C GPU@43C AO@46.5C thermal@44.5C POM_5V_IN 3686/3731 POM_5V_GPU 0/9 POM_5V_CPU 1351/1341
RAM 745/3956MB (lfb 609x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41C CPU@46C PMIC@100C GPU@43C AO@47C thermal@44.5C POM_5V_IN 3727/3731 POM_5V_GPU 0/8 POM_5V_CPU 1392/1346
RAM 749/3956MB (lfb 608x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41.5C CPU@46C PMIC@100C GPU@43.5C AO@47C thermal@44.75C POM_5V_IN 3727/3730 POM_5V_GPU 0/7 POM_5V_CPU 1351/1347
RAM 753/3956MB (lfb 607x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,4%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41.5C CPU@46.5C PMIC@100C GPU@43.5C AO@47.5C thermal@45C POM_5V_IN 3686/3727 POM_5V_GPU 0/6 POM_5V_CPU 1392/1351
RAM 757/3956MB (lfb 606x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [4%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@46.5C PMIC@100C GPU@44C AO@47.5C thermal@45C POM_5V_IN 3727/3727 POM_5V_GPU 0/6 POM_5V_CPU 1392/1354
RAM 763/3956MB (lfb 605x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@46.5C PMIC@100C GPU@44C AO@47.5C thermal@45C POM_5V_IN 3727/3727 POM_5V_GPU 0/5 POM_5V_CPU 1392/1356
RAM 767/3956MB (lfb 604x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@47.5C thermal@45.5C POM_5V_IN 3727/3727 POM_5V_GPU 0/5 POM_5V_CPU 1351/1356
RAM 773/3956MB (lfb 602x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [4%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@47C thermal@45.25C POM_5V_IN 3727/3727 POM_5V_GPU 0/5 POM_5V_CPU 1392/1358
RAM 777/3956MB (lfb 601x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,1%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@47.5C thermal@45.75C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1360
RAM 795/3956MB (lfb 597x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@48C thermal@45.5C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1362
RAM 799/3956MB (lfb 596x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42.5C CPU@47C PMIC@100C GPU@44.5C AO@48C thermal@45.25C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1363
RAM 807/3956MB (lfb 594x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42.5C CPU@47C PMIC@100C GPU@44.5C AO@48C thermal@45.5C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1365

Hi,
This looks to be constraint in CPU capability. Two CPU cores are at 100% loading while executing the test app. If you need the function, we would suggest implement CUDA code to do the same thing so that the loading can be shifted to GPU.

The program is of course only a show case. We discovered this problem when loading a ~100MB model file from disk into RAM. As I said, on the Nano it takes up to 40s. Why are simple memory allocations so CPU intensive?

The same program run on an ODROID XU4 (arm/v7) in 2.5s

Hi,
We try the test code on Jetson Nano, TX2 NX, XavierNX:

Jetson Nano: Needed: 6273[ms]
TX2 NX: Needed: 4595[ms]
Xavier NX: Needed: 2070[ms]

It looks like the performance is dominated by CPU cache. TX2 NX can run up to 2GHz but the performance is worse than Xavier NX. From your comments it looks like the CPU cache on ODROID XU4 is better than Jetson Nano.

Hi,
Interesting, the Xavier NX result surprised me, given the fact that the go example executed in 404ms.
I reran the C++ example on my Xavier NX and got 1505ms.
That is kind of unexpected to me, given the fact that the go sample performed so much worse than the C++ example on both the Nano and TX2.

Is there anything else we could try? Or is the only solution to ditch Nano and TX2 and resort back to Xavier NX?
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.