Slow memmove on Jetson Nano

wahtari · April 8, 2022, 10:45am

Hi, we have a Jetson Nano Module with 4GB of RAM and 16GB of eMMC storage.
The following scenario was tested with L4T 32.5.1 and 32.7.1, both producing the same result

I have a very simple go program that allocates 100MB of memory and then creates a second, empty slice and appends those 100MB again to it.
As expected, this causes lots of memory allocations, since the slice dynamically grows and must be copied to a new memory segment each time.
What I did not expect, is the absurd execution time of ~25 seconds.

The same code runs fine on a Xavier AGX, needing only 88ms to finish there.
EDIT: Tested it on a Xavier NX as well, 404ms.

Can somebody help me debug this or tell me, if this is expected or not?

Here is the code (go run main.go), I am using go version go1.18 linux/arm64)

package main

import (
        "fmt"
        "time"
)


func main() {
        b1 := make([]byte, 1024*1024*100)
        b2 := make([]byte, 0)

        start := time.Now()
        for _, v := range b1 {
                b2 = append(b2, v + 1)
        }
        fmt.Printf("finished after %s\n", time.Since(start).String())
}

DaneLLL · April 11, 2022, 3:01am

Hi,
Please xecute sudo nvpmodel -m 0 and sudo jetson_clocks, and try again. See if there is improvement by running all CPU cores at maximum clock.

wahtari · April 11, 2022, 11:46am

Hi, thanks for attending my problem.
I did that and reran the program, it did not change much.
The execution time varies a lot, I had everything now from 13 to 40 seconds.

wahtari · April 11, 2022, 2:03pm

I have tested 3 different Jetson Nano modules, each behaves the same.
Also tested a TX2 NX module, it needed anywhere between 8-16s

All Tests with hightest performance mode

wahtari · April 11, 2022, 2:40pm

I wrote a small C++ code sample that is hopefully equivalent to the Go program:

// Your First C++ Program

#include <iostream>
#include <chrono>
#include <vector>

int main() {
    std::vector<char> v(1024*1024*100);
    std::vector<char> v2;

    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for (const auto a : v) {
        v2.push_back(a);
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Needed: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "[ms]" << std::endl;
    return 0;
}

It needs 6,2s, and that is pretty consistent in between runs. That is very different from the Go example.

DaneLLL · April 12, 2022, 2:51am

Hi,
Please run sudo tegrastats and check if CPU loading is full and caps the performance.

wahtari · April 12, 2022, 6:56am

Hi,
This is the output of one run, 2 CPU cores are fully loaded:

RAM 693/3956MB (lfb 623x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [13%@1036,28%@1036,16%@1036,12%@1036] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@38C CPU@41.5C PMIC@100C GPU@38C AO@44C thermal@40.75C POM_5V_IN 4055/4055 POM_5V_GPU 81/81 POM_5V_CPU 1187/1187
RAM 711/3956MB (lfb 618x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [7%@1479,85%@1479,9%@1479,77%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@39.5C CPU@44C PMIC@100C GPU@41.5C AO@44.5C thermal@42.75C POM_5V_IN 3686/3870 POM_5V_GPU 0/40 POM_5V_CPU 1392/1289
RAM 715/3956MB (lfb 617x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [4%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@44.5C PMIC@100C GPU@42C AO@45.5C thermal@43.25C POM_5V_IN 3686/3809 POM_5V_GPU 0/27 POM_5V_CPU 1351/1310
RAM 721/3956MB (lfb 615x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@45C PMIC@100C GPU@42C AO@45.5C thermal@43.25C POM_5V_IN 3686/3778 POM_5V_GPU 0/20 POM_5V_CPU 1351/1320
RAM 723/3956MB (lfb 615x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40.5C CPU@45C PMIC@100C GPU@42.5C AO@46C thermal@43.5C POM_5V_IN 3686/3759 POM_5V_GPU 0/16 POM_5V_CPU 1351/1326
RAM 727/3956MB (lfb 614x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,1%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@40.5C CPU@45.5C PMIC@100C GPU@42.5C AO@46C thermal@44C POM_5V_IN 3686/3747 POM_5V_GPU 0/13 POM_5V_CPU 1351/1330
RAM 731/3956MB (lfb 613x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [3%@1479,100%@1479,1%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41C CPU@45.5C PMIC@100C GPU@43C AO@46.5C thermal@44C POM_5V_IN 3686/3738 POM_5V_GPU 0/11 POM_5V_CPU 1351/1333
RAM 737/3956MB (lfb 611x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41C CPU@45.5C PMIC@100C GPU@43C AO@46.5C thermal@44.25C POM_5V_IN 3727/3737 POM_5V_GPU 0/10 POM_5V_CPU 1392/1340
RAM 741/3956MB (lfb 610x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41.5C CPU@45.5C PMIC@100C GPU@43C AO@46.5C thermal@44.5C POM_5V_IN 3686/3731 POM_5V_GPU 0/9 POM_5V_CPU 1351/1341
RAM 745/3956MB (lfb 609x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41C CPU@46C PMIC@100C GPU@43C AO@47C thermal@44.5C POM_5V_IN 3727/3731 POM_5V_GPU 0/8 POM_5V_CPU 1392/1346
RAM 749/3956MB (lfb 608x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41.5C CPU@46C PMIC@100C GPU@43.5C AO@47C thermal@44.75C POM_5V_IN 3727/3730 POM_5V_GPU 0/7 POM_5V_CPU 1351/1347
RAM 753/3956MB (lfb 607x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,4%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@41.5C CPU@46.5C PMIC@100C GPU@43.5C AO@47.5C thermal@45C POM_5V_IN 3686/3727 POM_5V_GPU 0/6 POM_5V_CPU 1392/1351
RAM 757/3956MB (lfb 606x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [4%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@46.5C PMIC@100C GPU@44C AO@47.5C thermal@45C POM_5V_IN 3727/3727 POM_5V_GPU 0/6 POM_5V_CPU 1392/1354
RAM 763/3956MB (lfb 605x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@46.5C PMIC@100C GPU@44C AO@47.5C thermal@45C POM_5V_IN 3727/3727 POM_5V_GPU 0/5 POM_5V_CPU 1392/1356
RAM 767/3956MB (lfb 604x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@47.5C thermal@45.5C POM_5V_IN 3727/3727 POM_5V_GPU 0/5 POM_5V_CPU 1351/1356
RAM 773/3956MB (lfb 602x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [4%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@47C thermal@45.25C POM_5V_IN 3727/3727 POM_5V_GPU 0/5 POM_5V_CPU 1392/1358
RAM 777/3956MB (lfb 601x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,1%@1479,100%@1479] EMC_FREQ 7%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@47.5C thermal@45.75C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1360
RAM 795/3956MB (lfb 597x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@47C PMIC@100C GPU@44C AO@48C thermal@45.5C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1362
RAM 799/3956MB (lfb 596x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [2%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42.5C CPU@47C PMIC@100C GPU@44.5C AO@48C thermal@45.25C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1363
RAM 807/3956MB (lfb 594x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1479,100%@1479,0%@1479,100%@1479] EMC_FREQ 8%@1600 GR3D_FREQ 0%@76 VIC_FREQ 0%@192 APE 25 PLL@42.5C CPU@47C PMIC@100C GPU@44.5C AO@48C thermal@45.5C POM_5V_IN 3727/3727 POM_5V_GPU 0/4 POM_5V_CPU 1392/1365

DaneLLL · April 12, 2022, 9:35am

Hi,
This looks to be constraint in CPU capability. Two CPU cores are at 100% loading while executing the test app. If you need the function, we would suggest implement CUDA code to do the same thing so that the loading can be shifted to GPU.

wahtari · April 12, 2022, 12:21pm

The program is of course only a show case. We discovered this problem when loading a ~100MB model file from disk into RAM. As I said, on the Nano it takes up to 40s. Why are simple memory allocations so CPU intensive?

The same program run on an ODROID XU4 (arm/v7) in 2.5s

DaneLLL · April 14, 2022, 3:23am

Hi,
We try the test code on Jetson Nano, TX2 NX, XavierNX:

Jetson Nano: Needed: 6273[ms]
TX2 NX: Needed: 4595[ms]
Xavier NX: Needed: 2070[ms]

It looks like the performance is dominated by CPU cache. TX2 NX can run up to 2GHz but the performance is worse than Xavier NX. From your comments it looks like the CPU cache on ODROID XU4 is better than Jetson Nano.

wahtari · April 14, 2022, 7:19am

Hi,
Interesting, the Xavier NX result surprised me, given the fact that the go example executed in 404ms.
I reran the C++ example on my Xavier NX and got 1505ms.
That is kind of unexpected to me, given the fact that the go sample performed so much worse than the C++ example on both the Nano and TX2.

Is there anything else we could try? Or is the only solution to ditch Nano and TX2 and resort back to Xavier NX?
Thanks.

system · May 18, 2022, 12:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Xorg use high cpu loading after Ubuntu suspend/wake up many times Jetson Xavier NX nvbugs , performance	22	3474	October 19, 2022
Torch Inference slows down after a few iterations Jetson Nano pytorch	4	615	March 2, 2022
Hardware for running dedicated Jupyter Lab server on lan Jetson Nano hw	9	802	October 15, 2021
Jetson nano nvv4l2h264enc stop working sometimes Jetson Nano encoder	34	2029	March 2, 2022
Why do I see GPU memory usage when using GStreamer for MP4 decoding on Jetson NX? Where is GPU utilized in hardware decoding? Jetson TX2 decoder	4	330	January 1, 2025
CPU performance is worse on the Xavier then the TX2 Jetson AGX Xavier	9	2244	October 18, 2021
Jetson Nano shows 100% CPU Usage after 30 minutes with Deepstream-app demo DeepStream SDK	21	6095	October 12, 2021
Yolov5 slow inference on Jetson Xavier NX16 Jetson Xavier NX ai	10	1583	October 26, 2022
Still cant get Nano 2gb to process Epoch, python3 memory crash Jetson Nano python , nano2gb	8	618	October 15, 2021
CPU cores always run at highest frequency... why? Jetson Nano kernel	26	1433	October 18, 2021

Slow memmove on Jetson Nano

Related topics