Docker exec fails on xavier

I am getting the following error message when running ‘docker exec’ command on Drive AGX xavier:

$ docker run --detach --rm --name ubuntu -it ubuntu bash
1c81ac50d0f52a915ecb08bb56c73ab14e57045acea03ee5247d5b65cd8af78a
$ docker exec -it ubuntu bash
OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "close exec fds: open /proc/self/fd: no such file or directory": unknown

Others have reported this error online as well. Is there a solution for it yet? Is docker supported on Drive AGX?

I am using SDK version: 1.0.0.5517 and Drive Software 10.0

1 Like

Dear anurag08upx,
We are checking on this internally and get back to you.

I’m having the same issue here.

Don’t know the exact SDK version and Drive Software version. It’s probably version 9.0. Hint: stock TensorRT version is 5.0.3

I’ve test docker-ce 18.09.7, and 19.03.6, both have exactly the same error msg provide.

any update?

Hi, I’ve also opened an issue on github/docker-for-linux: https://github.com/docker/for-linux/issues/939

Hopefully it’ll be solved soon.

Hi All,

Have you installed Docker following the cross compilation guide for ARM: https://www.docker.com/blog/getting-started-with-docker-for-arm-on-linux/

Can you share the usecase(s) you’d like to use Docker for on the AGX? It may help to understand the priority of supporting docker more formally.

Hi LukeNV

I am not doing any cross compilation for ARM. All out docker images for the AGX are created and run on an ARM machine.

Docker is part of our deployment strategy. Every new build of our application is packaged as a docker image and deployed/run on the AGX.
A deployment script fetches the latest docker image on to the AGX and starts it with the following command:

docker run --name

An engineer can, if and when needed, log into the AGX and access the container with the following command:

docker exec -it bash

We need to access the container running on AGX for various development/debugging purposes.

We did some more debugging into the issue and think that the cause of this issue is most likely the custom nvidia linux kernel.

Some more detailed debug logs:

Run the stock nginx container:
​
`$ docker run -it --rm -d -p "8080:80" --name nginx nginx` 
​
Ensure it's working: 
​
`$ curl localhost:8080`
​
Returns the default nginx "successfully installed" html.
​
Attempt to exec into the nginx container:
​
`$docker exec -it nginx bash`
​
```shell
$ docker -l "debug" exec -it nginx bash
OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "close exec fds: open /proc/self/fd: no such file or directory": unknown
DEBU[0000] [hijack] End of stdout
​
 EBU[0000] Error resize: Error response from daemon: no such exec
​

dockerd logs

Feb 26 12:32:40 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:40.339431850-08:00" level=debug msg="Calling HEAD /_ping"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.566852964-08:00" level=debug msg="Calling HEAD /_ping"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.568211972-08:00" level=debug msg="Calling GET /v1.40/containers/nginx/json"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.573138148-08:00" level=debug msg="Calling POST /v1.40/containers/nginx/exec"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.573408964-08:00" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"bash\"],\"Detach\":false,\"DetachKeys\":\"\",\"Env\":null,\"Privileged\":false,\"Tty\":true,\"User\":\"\",\"WorkingDir\":\"\"}"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.574936612-08:00" level=debug msg="Calling POST /v1.40/exec/425c922e3eae3a04332fbcc776b4786098816a02707d4b6b24f1ef92cc22217c/start"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.575067524-08:00" level=debug msg="form data: {\"Detach\":false,\"Tty\":true}"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.575322148-08:00" level=debug msg="starting exec command 425c922e3eae3a04332fbcc776b4786098816a02707d4b6b24f1ef92cc22217c in container 8eac8b63764e78dbff555cee66eaa7f0b6869e82c0ded5db5b4f4972daee0f06"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.576859684-08:00" level=debug msg="Calling POST /v1.40/exec/425c922e3eae3a04332fbcc776b4786098816a02707d4b6b24f1ef92cc22217c/resize?h=32&w=71"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.580853956-08:00" level=debug msg="attach: stdin: begin"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.580950020-08:00" level=debug msg="attach: stdout: begin"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.917705890-08:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.919902562-08:00" level=debug msg="attach: stdout: end"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.919985602-08:00" level=debug msg="attach: stdin: end"
Feb 26 12:32:46 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:46.920070722-08:00" level=debug msg="attach done"
Feb 26 12:32:47 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:47.014723170-08:00" level=error msg="Error running exec 425c922e3eae3a04332fbcc776b4786098816a02707d4b6b24f1ef92cc22217c in container: OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"close exec fds: open /proc/self/fd: no such file or directory\": unknown"
Feb 26 12:32:47 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:47.014903714-08:00" level=debug msg="Closing buffered stdin pipe"
Feb 26 12:32:47 tegra-ubuntu dockerd[5726]: time="2020-02-26T12:32:47.055543042-08:00" level=debug msg="Calling GET /v1.40/exec/425c922e3eae3a04332fbcc776b4786098816a02707d4b6b24f1ef92cc22217c/json"

System Info

Kernel Version

$ uname -a
Linux tegra-ubuntu 4.14.102-rt53-tegra #1 SMP PREEMPT RT Fri Sep 20 16:23:45 PDT 2019 aarch64 aarch64 aarch64 GNU/Linux


Docker Info

$ docker info
Client:
 Debug Mode: false
​
Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 47
 Server Version: 19.03.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.102-rt53-tegra
 Operating System: Ubuntu 18.04.2 LTS (containerized)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 27.27GiB
 Name: tegra-ubuntu
 ID: AC5C:GETI:B62Q:EXXO:6P2X:HJPF:ZCJN:BYWI:V5Y7:JKRJ:5GPZ:O2Z2
 Docker Root Dir: /ota/pkg_data/docker
 Debug Mode: true
  File Descriptors: 33
  Goroutines: 43
  System Time: 2020-02-26T12:36:04.217989755-08:00
  EventsListeners: 0
 Username: voyagebot
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
​
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support

Hello, i’m facing similar issue on DriveSw.10.0 (DriveOS 5.1.6). My error is a bit different with docker 19.03.8 and runc 1.0.0-rc10.

> docker exec -ti xxx /bin/bash
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer": unknown

But if i downgrade to older docker (e.g. 18.06.3) i see exactly the same error “close exec fds: open /proc/self/fd: no such file or directory”.

I also checked kernel config with official “check-config.sh” as described here and it looks fine (all “generally Necessary” options are enabled):

I assume error can be related to some specific changes that Nvidia does to kernel.

After digging a bit deeper i found a bit more details. I run bare runc container as described here. When i exec inside the container i get

# runc exec mycontainerid ps 
panic: cannot statfs cgroup root

goroutine 1 [running, locked to thread]:
github.com/opencontainers/runc/libcontainer/cgroups.IsCgroup2UnifiedMode.func1()
        /go/src/github.com/opencontainers/runc/libcontainer/cgroups/utils.go:45 +0xa8
sync.(*Once).Do(0x55652cbb20, 0x5565000328)
        /usr/local/go/src/sync/once.go:44 +0xc4
github.com/opencontainers/runc/libcontainer/cgroups.IsCgroup2UnifiedMode(0x20)
        /go/src/github.com/opencontainers/runc/libcontainer/cgroups/utils.go:42 +0x38
github.com/opencontainers/runc/libcontainer.Cgroupfs(0x40000c4240, 0x400000e560, 0x150)
        /go/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:92 +0x20
github.com/opencontainers/runc/libcontainer.New(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5, 0x5564e12e5d, 0x40000b56d8)
        /go/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:176 +0x16c
main.glob..func6(0x40000c8840, 0x5564fdaa00, 0x40000b56d8)
        /go/src/github.com/opencontainers/runc/init.go:42 +0x2c
github.com/opencontainers/runc/vendor/github.com/urfave/cli.HandleAction(0x5564f83620, 0x5565000958, 0x40000c8840, 0x400005a300, 0x0)
        /go/src/github.com/opencontainers/runc/vendor/github.com/urfave/cli/app.go:490 +0xd0
github.com/opencontainers/runc/vendor/github.com/urfave/cli.Command.Run(0x5564e12afc, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5564e27d77, 0x51, 0x0, ...)
        /go/src/github.com/opencontainers/runc/vendor/github.com/urfave/cli/command.go:210 +0x70c
github.com/opencontainers/runc/vendor/github.com/urfave/cli.(*App).Run(0x4000102340, 0x4000088020, 0x2, 0x2, 0x0, 0x0)
        /go/src/github.com/opencontainers/runc/vendor/github.com/urfave/cli/app.go:255 +0x4f4
main.main()
        /go/src/github.com/opencontainers/runc/main.go:145 +0x948
ERRO[0000] exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer" 
exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer"

Which fails on execution of stafs for /sys/fs/cgroup as per code. I also see following statfs call in strace

statfs("/sys/fs/cgroup", {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=3574287, f_bfree=3574287, f_bavail=3574287, f_files=3574287, f_ffree=3574270, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RDONLY|ST_NOSUID|ST_NODEV|ST_NOEXEC}) = 0

These source files can be helpful:

https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/nsexec.c
https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/utils.go

This is what i have for now.

1 Like

I think i have a workaround for docker exec.

The issue seems related to the fact that actual rootfs of container will be not /, but /new_root, when you’re entering container mount namespace (sudo nsenter -t PID -m ls -l /). Only on nvidia drive i see /new_root. As was pointer out in this comment from 2017, workaround is to actually chroot into /new_root.

So final version that works for me (add to .bashrc)

docker-exec () { sudo nsenter --target $(docker inspect --format {{.State.Pid}} ${1}) -a /usr/sbin/chroot /new_root ${@:2}; }

and then

docker-exec NAME bash

Why /new_root exists only on nvidia?
I’m not exactly sure, but i can assume this somehow can be related pivot_root system call differences. So far not sure at all, i see that there are multiple ways runc prepares rootfs.

2 Likes

@anurag08upx , our devOp is looking for a solution to create a docker image/container with GPU support on Drive AGX, is your docker image supposed to have GPU support ?
Thanks!

Hi @shayan.manoochehri

Thanks for looking into the issue. The problem is not with creating a docker image/container.
The problem is with running the “docker exec” command (to connect to an already running container)

There is a workaround, as others have posted on the thread, but it would be great if Nvidia can look into the root cause and solve it in the kernel.

Thanks

FWIW the workaround proposed by roman48tdr is working on my AGX.

1 Like