Docker Engine Dependencies JetPack 5.1

Hello everyone!
My first post here nice to meet you all

So my question today will be about docker engine installed on nvidia xavier nx JetPack 5.1 using Airvolute carrier board

We have a lot of development using Docker images and containers, and we have a issue about it.
When we start the Jetson we want the Docker containers to start with the OS, now usually it works fine with no issue
When we start the Jetson Without ethernet cable (port eth0) the docker engine loaded after 4-5 minutes, even if I connect the cable to port eth1 using external device the docker engine still loading in delay.
I tried to bypass the docker.service in /lib/systemd/system path disabling the network-online.target service in the “After” and “Wants” sections, also tried to play with them (only “After” section or only “Wants” section or disabling only rhe network-online.target service in the docker.service file)
when I have made changed in this file, the containers started to restart after the reboot and have never been stable.

So what solution I want to do?
I want to be able to cancel the eth0 cooldown or consume or dependent on the docker.service.
I want to be able to start the docker engine right after the device finishing the boot and all the containers to be stable and running
I want the true solution for this and not a temporary solution, this solution going to be on the product line of the product

Thank you all!

Hi,

Suppose the container already exists in your environment so no downloading is needed, is that correct?

Which command do you use for launching?
We would like to reproduce this locally and give a closer look.

Thanks.

thats the command for the docker run

#!/usr/bin/env bash

mack_md5_hash=$(ip link show | awk ‘/ether/ {print $2}’ | head -1 | md5sum | awk ‘{print $1}’)
max_ros_domainId=229
ros_domainId=$((16#${mack_md5_hash} % ${max_ros_domainId}))
[[ ${ros_domainId} -lt 0 ]] && ros_domainId=$((${ros_domainId} + ${max_ros_domainId}))

docker run
–name {container-name}
–device=/dev/ttyTHS0:/dev/ttyTHS0
–device=/dev/spidev0.0:/dev/spidev0.0
–net=host
–ipc=host
–gpus all
–runtime nvidia
–publish-all
-v $(realpath $(dirname $(realpath “$0”))/…/…/):/{Path}
-v /tmp/.X11-unix/:/tmp/.X11-unix/
-v /tmp/argus_socket:/tmp/argus_socket
–cap-add SYS_PTRACE
-e RMW_IMPLEMENTATION=rmw_fastrtps_cpp
-e ROS_DOMAIN_ID=${ros_domainId}
-e PYTHONOPTIMIZE=2
-it
-d
–restart unless-stopped
image-name:latest
/{Path}/script.sh

.
.

I just notice something, I tried to flash another Jetson from 0 (takes a lot of time) and the docker.service has been loaded very fast without eth0 port connected.
The way I used to flash the devices is using l4t-backup-and-restore scripts to create a golden image and deploy it to all the other devices because it takes up to three hours to flash in the traditional way, but using l4t_backup_and_restore takes about 35 minutes.

There is any reason why the docker.service working like this?

It just happened again, I used a regular clear jetson, created it the old way using scripts to flash and installing docker and cuda and tensorRT using nvidia SDK manager, all worked perfectly
Then I took this device and used backup option in l4t_backup_resore.sh script and restore it to another device, all the data has been transeferd but the docker.service loading in delay again

systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: inactive (dead)
TriggeredBy: ● docker.socket
Docs: https://docs.docker.com

.
.

cat /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket
Wants=containerd.service

[Service]
Type=notify
$ the default is not to use systemd for cgroups because the delegate issues still
$ exists and systemd currently does not support the cgroup feature set required
$ for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

$ Note that StartLimit* options were moved from “Service” to “Unit” in systemd 229.
$ Both the old, and new location are accepted by systemd 229 and up, so using the old location
$ to make them work for either version of systemd.
StartLimitBurst=3

$ Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
$ Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
$ this option work for either version of systemd.
StartLimitInterval=60s

$ Having non-zero Limit*s causes performance problems due to accounting overhead
$ in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

$ Comment TasksMax if your systemd version does not support it.
$ Only systemd 226 and above support this option.
TasksMax=infinity

$ set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

$ kill only the docker process, not all processes in the cgroup
KillMode=process
OOMScoreAdjust=-500

[Install]
WantedBy=multi-user.target
.
.
.
.

After 3-4 minutes

.
.
systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2024-02-26 13:09:11 UTC; 8s ago
TriggeredBy: ● docker.socket
Docs: https://docs.docker.com
Main PID: 1981 (dockerd)
Tasks: 15
Memory: 105.8M
CGroup: /system.slice/docker.service
└─1981 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Feb 26 13:09:09 dcs dockerd[1981]: time=“2024-02-26T13:09:09.175038215Z” level=info msg="Removing stale sandbox 5ab530eb3fd387ed0d79c9c455c38aa1f180df4f0c8aab16fb36e72c0e8c70a7 (20a5a62c88269fcd4c4a870daf0>
Feb 26 13:09:09 dcs dockerd[1981]: time=“2024-02-26T13:09:09.185577663Z” level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 4b4dcd194907162f0f4998d95e3f4>
Feb 26 13:09:09 dcs dockerd[1981]: time=“2024-02-26T13:09:09.210214040Z” level=info msg="Removing stale sandbox ff3766f94fb4ad55ad7e17ab76a5711f854e7bdfd49e057c2911bbbc74d144b9 (1e76ad364e19b35d5ccc005d6e1>
Feb 26 13:09:09 dcs dockerd[1981]: time=“2024-02-26T13:09:09.215685768Z” level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 4b4dcd194907162f0f4998d95e3f4>
Feb 26 13:09:09 dcs dockerd[1981]: time=“2024-02-26T13:09:09.341047506Z” level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a pref>
Feb 26 13:09:11 dcs dockerd[1981]: time=“2024-02-26T13:09:11.674037594Z” level=info msg=“Loading containers: done.”
Feb 26 13:09:11 dcs dockerd[1981]: time=“2024-02-26T13:09:11.738783528Z” level=info msg=“Docker daemon” commit=“24.0.5-0ubuntu1~20.04.1” graphdriver=overlay2 version=24.0.5
Feb 26 13:09:11 dcs dockerd[1981]: time=“2024-02-26T13:09:11.741402743Z” level=info msg=“Daemon has completed initialization”
Feb 26 13:09:11 dcs systemd[1]: Started Docker Application Container Engine.
Feb 26 13:09:11 dcs dockerd[1981]: time=“2024-02-26T13:09:11.871651329Z” level=info msg=“API listen on /run/docker.sock”

And of course I almost forgot

docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
20a5a62c8826 image-name:latest “Path” 4 hours ago Up 10 minutes container-name
1e76ad364e19 image-name:latest “Path” 5 hours ago Up 10 minutes container-name

Hi,

So if using l4t_backup_and_restore, the docker starts with delay when no network connection.
But works normally with a network connected.

If flashing with standard SDKmanager, docker starts without delay with or without a network connection, is that correct?

If yes, are there any dependencies or updates in the backup image?
Or it is an identical environment as a clean SDK manager flashing?

Thanks.

Thank you for responding,

Yes you are correct, when I use backup_and_restore process the docker starts with delay when there is no network interface connect, when I use regular flash and installing docker using SDK Manager the docker starts right up with the OS with no delay with or without network connectivity.
I tried to reinstall the docker using SDK Manager (didn’t deleted it) and no changes.

As far as I know, there is no dependencies with the backup image, I made it up to the same point using the backup_and_restore process and the manual process using the regular basic flash, installing docker using SDK Manager and building docker images, my test in both of the Jetson’s was in the exact same point, and the exact same softwares and changes. (Not OS changes)

I’m saying I’m using regular flash because I’m not using SDK Manager for the flashing procedure, the reason for that is I’m using custom board that have been designed and developed by Airvolute, they supply a flashing script with their add-on custom BSP files.

When I’m using backup and restore my steps is first flashing the device with the regular flash script provided by Airvolute to contain the BSP files, and then to “spill” the backup image to the device, the reason for using backup_and_restore script is because when I restore a device to the point it’s ready to use takes about 30 minutes more or less, using the manual procedure (installing docker and all the custom docker images, softwares and changes) takes about 3 hours, it’s more effective in less time.

Hope I’v answered all your questions, if I missed something I’ll answer ASAP

Best regards

Hi,

Thanks a lot for the explanation.
We will try to reproduce this internally and then get back to you soon.

Thanks.

Hi,

Sorry that one more question.

With the l4t_backup_and_restore approach, do you observe the delay when connecting a network?
Or it launch immediately as the device that set up with SDKmanager?

Thanks.

Hi,

When using l4t_backup_and_restore I’m getting the delay when no network device connected to eth0, even if I connect it to eth1 I still having the delay, only eth0 disabling the delay
Even if I’m trying to fix the docker engine in the SDK Manager its still in delay, The only time it works perfectly fine is when I flashing manually the Jetson using the custom flash script (provided here Airvolute-Github-repo ) and installing all the other software’s using the SDK manager but as I said before

Manual flashing - about 3 hours and needs a human to install everything
Auto flashing using l4t_backup_and_restore - 30 min no need human to install anything, runs perfectly

Thanks!

Hi,

Could you help to check if there is any error log from the docker daemon?

$ sudo journalctl -fu docker.service

Thanks.

Hey!
sorry for the delay, it was my weekend.
Thats the logs

Feb 26 11:23:29 dcs dockerd[2028]: time=“2024-02-26T11:23:29.778271672Z” level=info msg=“[graphdriver] using prior storage driver: overlay2”
Feb 26 11:23:30 dcs dockerd[2028]: time=“2024-02-26T11:23:30.152972376Z” level=info msg=“Loading containers: start.”
Feb 26 11:23:30 dcs dockerd[2028]: time=“2024-02-26T11:23:30.777955448Z” level=info msg=“Removing stale sandbox 952fc816ed79ae24cccded57f711a71e8b047cc92f25a4b732ee8a7cb99986e5 (20a5a62c88269fcd4c4a870daf05c066522e4e6a56226c7d7a09fc5500e9428f)”
Feb 26 11:23:30 dcs dockerd[2028]: time=“2024-02-26T11:23:30.789121592Z” level=warning msg=“Error (Unable to complete atomic operation, key modified) deleting object [endpoint 4b4dcd194907162f0f4998d95e3f4f1ed93fcf8acdbda82de70f0283d408bf44 e8a4b9de24b573c9319f02777f95250a7d630a0b76cc11e29fc11dc5ca453348], retrying…”
Feb 26 11:23:30 dcs dockerd[2028]: time=“2024-02-26T11:23:30.807977944Z” level=info msg=“Removing stale sandbox f1d2d13d4aa448102500925629169cf9f2de0798f144d6dad5246bf076e88f66 (1e76ad364e19b35d5ccc005d6e1aadc7839d69ccf16afda20bdb32ad2307ff57)”
Feb 26 11:23:30 dcs dockerd[2028]: time=“2024-02-26T11:23:30.814609080Z” level=warning msg=“Error (Unable to complete atomic operation, key modified) deleting object [endpoint 4b4dcd194907162f0f4998d95e3f4f1ed93fcf8acdbda82de70f0283d408bf44 29b3a2e2389d45b77c8b4c2dd48d6ab2752aae27ea7286923096b9bb4b2f3a3d], retrying…”
Feb 26 11:23:30 dcs dockerd[2028]: time=“2024-02-26T11:23:30.935911704Z” level=info msg=“Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address”
Feb 26 11:23:33 dcs dockerd[2028]: time=“2024-02-26T11:23:33.307217240Z” level=info msg=“Loading containers: done.”
Feb 26 11:23:33 dcs dockerd[2028]: time=“2024-02-26T11:23:33.367878968Z” level=info msg=“Docker daemon” commit=“24.0.5-0ubuntu1~20.04.1” graphdriver=overlay2 version=24.0.5
Feb 26 11:23:33 dcs dockerd[2028]: time=“2024-02-26T11:23:33.368542840Z” level=info msg=“Daemon has completed initialization”
Feb 26 11:23:33 dcs dockerd[2028]: time=“2024-02-26T11:23:33.484736536Z” level=info msg=“API listen on /run/docker.sock”
Feb 26 11:23:33 dcs systemd[1]: Started Docker Application Container Engine.

Reviced the logs in delay because the Docker.service started in delay

.
.
.
This is the logs when I start the Jetson with no delay in Docker.service

– Logs begin at Tue 2023-11-21 21:10:22 UTC. –
Mar 10 16:07:03 dcs dockerd[1659]: time=“2024-03-10T16:07:03.553459488Z” level=info msg=“Starting up”
Mar 10 16:07:03 dcs dockerd[1659]: time=“2024-03-10T16:07:03.559601888Z” level=info msg=“detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf”
Mar 10 16:07:03 dcs dockerd[1659]: time=“2024-03-10T16:07:03.686007040Z” level=info msg=“[graphdriver] using prior storage driver: overlay2”
Mar 10 16:07:04 dcs dockerd[1659]: time=“2024-03-10T16:07:04.091487744Z” level=info msg=“Loading containers: start.”
Mar 10 16:07:36 dcs dockerd[1659]: time=“2024-03-10T16:07:36.184854014Z” level=info msg=“Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address”
Mar 10 16:07:38 dcs dockerd[1659]: time=“2024-03-10T16:07:38.612350814Z” level=info msg=“Loading containers: done.”
Mar 10 16:07:38 dcs dockerd[1659]: time=“2024-03-10T16:07:38.671300286Z” level=info msg=“Docker daemon” commit=“24.0.5-0ubuntu1~20.04.1” graphdriver=overlay2 version=24.0.5
Mar 10 16:07:38 dcs dockerd[1659]: time=“2024-03-10T16:07:38.671919486Z” level=info msg=“Daemon has completed initialization”
Mar 10 16:07:38 dcs dockerd[1659]: time=“2024-03-10T16:07:38.803849822Z” level=info msg=“API listen on /run/docker.sock”

Again, sorry for the delay.
Thanks!

Hi,

Based on the log, the delay might relate to the sandbox
Similar to the below issue:

We are checking this further. Will update more info with you.

Thanks.

Hey,
Thanks for the answer! appreciate very much
But
The Docker service still starting in delay, it still have the error in docker.service when i entered the “sudo journalctl -fu docker.service”

$ sudo journalctl -fu docker.service
– Logs begin at Tue 2023-11-21 21:10:22 UTC. –
Mar 11 08:37:35 dcs dockerd[1982]: time=“2024-03-11T08:37:35.657036400Z” level=info msg=“[graphdriver] using prior storage driver: overlay2”
Mar 11 08:37:36 dcs dockerd[1982]: time=“2024-03-11T08:37:36.032509584Z” level=info msg=“Loading containers: start.”
Mar 11 08:37:36 dcs dockerd[1982]: time=“2024-03-11T08:37:36.390700752Z” level=info msg=“Removing stale sandbox 350fcb34de29054d86e794608abbdd79d1a5c038d336738b5059b030b91796fa (a4f5bb74663e1e66a588c729b73db82304ccf4579b7e41b7e85f0e7b03abea97)”
Mar 11 08:37:36 dcs dockerd[1982]: time=“2024-03-11T08:37:36.397288784Z” level=warning msg=“Error (Unable to complete atomic operation, key modified) deleting object [endpoint 4b4dcd194907162f0f4998d95e3f4f1ed93fcf8acdbda82de70f0283d408bf44 f4cea004cad84161334419d1a69ab75a7405be0ce4cf690f0fd25c229800e32f], retrying…”
Mar 11 08:37:36 dcs dockerd[1982]: time=“2024-03-11T08:37:36.411266800Z” level=info msg=“Removing stale sandbox dccbb705d02ba845bb4b8ff91f1a66b559426f24c614abdf33a004e296dae0a7 (9e45e740d269275fc80e1efdadea8cb9bc1104badc4f4d1c63696b9d0cb108db)”
Mar 11 08:37:36 dcs dockerd[1982]: time=“2024-03-11T08:37:36.418099664Z” level=warning msg=“Error (Unable to complete atomic operation, key modified) deleting object [endpoint 4b4dcd194907162f0f4998d95e3f4f1ed93fcf8acdbda82de70f0283d408bf44 a312d2cafdfed415b82b3f54d4faaff5573b8f78c1c7a1a360074fe6d42bcd66], retrying…”
Mar 11 08:37:38 dcs dockerd[1982]: time=“2024-03-11T08:37:38.678928976Z” level=info msg=“Loading containers: done.”
Mar 11 08:37:38 dcs dockerd[1982]: time=“2024-03-11T08:37:38.739522224Z” level=info msg=“Docker daemon” commit=“24.0.5-0ubuntu1~20.04.1” graphdriver=overlay2 version=24.0.5
Mar 11 08:37:38 dcs dockerd[1982]: time=“2024-03-11T08:37:38.740615216Z” level=info msg=“Daemon has completed initialization”
Mar 11 08:37:38 dcs dockerd[1982]: time=“2024-03-11T08:37:38.859371216Z” level=info msg=“API listen on /run/docker.sock”
Mar 11 08:37:38 dcs systemd[1]: Started Docker Application Container Engine.

Can someone explain more about the solution? it feels like a huge mess with all the branch’es and repo’s and the pull requests I can’t understand where is the solution

.

I tried to enter the first solution I saw “{ “bridge”: “none” }” and now the service looks like this
{
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”:
}
},
“default-runtime”: “nvidia”,
“bridge”: “none”
}

I tried also to change the “docker run” script but unsuccessfully it didn’t worked.
I can provide you anything you want just tell me what you need to see

Thats the docker run script
#!/usr/bin/env bash

set ros_domainId based on mack address
mack_md5_hash=$(ip link show | awk ‘/ether/ {print $2}’ | head -1 | md5sum | awk ‘{print $1}’)
#Probably the domainId is over 232 or portBase is too high.
max_ros_domainId=229
ros_domainId=$((16#${mack_md5_hash} % ${max_ros_domainId}))
[[ ${ros_domainId} -lt 0 ]] && ros_domainId=$((${ros_domainId} + ${max_ros_domainId}))

docker run command for mission-control container
docker run
–name name
–device=/dev/ttyTHS0:/dev/ttyTHS0
–device=/dev/spidev0.0:/dev/spidev0.0
–net=host
–ipc=host
–gpus all
–runtime nvidia
–publish-all
-v $(realpath $(dirname $(realpath “$0”))/…/…/):/workspace
-v /tmp/.X11-unix/:/tmp/.X11-unix/
-v /tmp/argus_socket:/tmp/argus_socket
–cap-add SYS_PTRACE
-e RMW_IMPLEMENTATION=rmw_fastrtps_cpp
-e ROS_DOMAIN_ID=${ros_domainId}
-e PYTHONOPTIMIZE=2
-it
-d
–restart unless-stopped
image-name:latest
PATH

docker run command for busybox container
for i in $(seq 1 4000); do
docker run --rm ubuntu;
done

Hey again guys
I need a solution as fast as you can provide, I’m on schedule on my work
Sorry for the pressure, if you can do something about it I’ll be glad
Thanks!

Hi,

Sorry to keep you waiting.
We will discuss this in our internal meeting tomorrow then get back to you.

Thanks.

1 Like

Hi

Could you test the below on the delay machine to see if it helps?

1. Delete below file

/var/lib/docker/network/files/local-kv.db

2. Restart

$ systemctl reload docker.service
$ systemctl restart docker.service

Thanks.

I tried this and unfortunately it didn’t work.
I can provide you static IP address if you want to connect with SSH,
If you do want please provide me an e-mail address

Thanks!

Hi,

Thanks but will check if we can reproduce this internally first.

1 Like