Hey,
Firstly thanks for awesome work! Everything works great when I’m using docker :) But if I try using containerd, I have a couple of issues.
First I get this error when I try to pull from NGC registry:
crictl pull nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
FATA[0001] pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3": failed to resolve reference "nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3": no scope specified for token auth challenge
While docker pull works:
docker pull nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
r32.4.2-tf1.15-py3: Pulling from nvidia/l4t-tensorflow
Digest: sha256:ba57af516a1b0c021660bfe621af7f92e0fc3f17ba13a7e5a9c1c2a71355080b
Status: Image is up to date for nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
I rebuilt the images and pushed to dockerhub https://hub.docker.com/repository/docker/povilasv/l4t-ml and you can pull from crictl and from docker hub fine. Which probably means that your docker registry is not behaving well.
Similiar issue on containerd repo, https://github.com/containerd/containerd/issues/3556, where jfrog had to fix their docker registry https://www.jfrog.com/jira/browse/RTFACT-20170.
The second question is how do I get containerd to work with GPU? I tried changing runtime in /etc/containerd/config.toml from runc to nvidia-container-runtime:
[plugins."io.containerd.runtime.v1.linux"]
#shim = "containerd-shim"
#runtime = "runc"
runtime = "/usr/bin/nvidia-container-runtime"
#runtime_root = ""
#no_shim = false
#shim_debug = false
Containerd version:
containerd --version
containerd github.com/containerd/containerd 1.3.3-0ubuntu1~18.04.2
Full containerd config file:
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
[grpc]
address = "/run/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[ttrpc]
address = ""
uid = 0
gid = 0
[debug]
address = ""
uid = 0
gid = 0
level = ""
[metrics]
address = ""
grpc_histogram = false
[cgroup]
path = ""
[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
stream_idle_timeout = "4h0m0s"
enable_selinux = false
sandbox_image = "k8s.gcr.io/pause:3.1"
stats_collect_period = 10
systemd_cgroup = false
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
disable_proc_mount = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
no_pivot = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v1"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
#shim = "containerd-shim"
#runtime = "runc"
runtime = "/usr/bin/nvidia-container-runtime"
#runtime_root = ""
#no_shim = false
#shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/arm64/v8"]
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = ""
pool_name = ""
base_image_size = ""
This doesn’t work and I get LD library not found errors from the containers.
Thank you!