Failing to start GPU docker via ECS

pietro3 · January 13, 2023, 1:07pm

Hello,

I came across an odd behaviour when launching a new task via ECS to start a GPU docker on an ECS instance. The setup worked a few weeks back, not sure if something has changed over this period.

I can successfully run the docker manually with ‘–runtime=nvidia’ and nvidia-smi returns the correct output, so it seems that all drivers are correctly installed.
However when the same command is triggered by ECS it returns the following:

level=info time=2023-01-13T12:44:09Z msg=“Sending state change to ECS” eventType=“task” eventData=“TaskChange: [arn:aws:ecs:us-east-2:346811575828:task/main01/93928a60ec9348f9a5cfd637a31ab7df → STOPPED, Known Sent: NONE, PullStartedAt: 0001-01-01 00:00:00 +0000 UTC, PullStoppedAt: 0001-01-01 00:00:00 +0000 UTC, ExecutionStoppedAt: 2023-01-13 12:44:09.715590988 +0000 UTC m=+5917.386641330, container change: arn:aws:ecs:us-east-2:346811575828:task/main01/93928a60ec9348f9a5cfd637a31ab7df main → STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’\nnvidia-container-cli: device error: GPU-138bd48d-9d6e-5b3b-494e-eff9d979e4df: unknown device: unknown, Known Sent: NONE] sent: false”

My /etc/ecs/ecs.config has not changed and it is still:

ECS_CLUSTER=main01
ECS_ENABLE_GPU_SUPPORT=true
ECS_NVIDIA_RUNTIME=nvidia
ECS_ENABLE_GPU=true
ECS_IMAGE_PULL_BEHAVIOR=prefer-cached

Topic		Replies	Views
Nvidia-container-cli: relocation error Docker and NVIDIA Docker	0	762	July 19, 2023
Docker:error Docker and NVIDIA Docker	0	740	July 29, 2021
Docker accessing GPU for Pytorch error Docker and NVIDIA Docker cuda	0	766	July 29, 2021
Getting error while running the docker Docker and NVIDIA Docker docker	0	1221	May 23, 2023
Nvidia-container-cli initialization Error Docker and NVIDIA Docker docker , wsl	0	4168	December 25, 2021
Unable to start CUDA container with recent update on November 10 Container: CUDA cuda , ubuntu , docker	5	4271	November 21, 2023
Problem starting a Docker container with GPU enabled Jetson AGX Xavier docker	4	5525	December 29, 2021
NVIDIA driver is not available on latest docker Docker and NVIDIA Docker cuda , docker	8	5695	July 5, 2023
Rootless Docker; ERROR: No supported GPU(s) detected to run this container Docker and NVIDIA Docker docker	2	7785	April 8, 2022
Applications not using GPU inside docker container Docker and NVIDIA Docker	1	1283	May 2, 2024

Failing to start GPU docker via ECS

Related topics