I came across an odd behaviour when launching a new task via ECS to start a GPU docker on an ECS instance. The setup worked a few weeks back, not sure if something has changed over this period.
I can successfully run the docker manually with ‘–runtime=nvidia’ and nvidia-smi returns the correct output, so it seems that all drivers are correctly installed.
However when the same command is triggered by ECS it returns the following:
level=info time=2023-01-13T12:44:09Z msg=“Sending state change to ECS” eventType=“task” eventData=“TaskChange: [arn:aws:ecs:us-east-2:346811575828:task/main01/93928a60ec9348f9a5cfd637a31ab7df → STOPPED, Known Sent: NONE, PullStartedAt: 0001-01-01 00:00:00 +0000 UTC, PullStoppedAt: 0001-01-01 00:00:00 +0000 UTC, ExecutionStoppedAt: 2023-01-13 12:44:09.715590988 +0000 UTC m=+5917.386641330, container change: arn:aws:ecs:us-east-2:346811575828:task/main01/93928a60ec9348f9a5cfd637a31ab7df main → STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’\nnvidia-container-cli: device error: GPU-138bd48d-9d6e-5b3b-494e-eff9d979e4df: unknown device: unknown, Known Sent: NONE] sent: false”
My /etc/ecs/ecs.config has not changed and it is still:
ECS_CLUSTER=main01 ECS_ENABLE_GPU_SUPPORT=true ECS_NVIDIA_RUNTIME=nvidia ECS_ENABLE_GPU=true ECS_IMAGE_PULL_BEHAVIOR=prefer-cached