Issues with Machine Learning Containers for Jetson on containerd

Hey,

Firstly thanks for awesome work! Everything works great when I’m using docker :) But if I try using containerd, I have a couple of issues.

First I get this error when I try to pull from NGC registry:

crictl pull nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
FATA[0001] pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3": failed to resolve reference "nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3": no scope specified for token auth challenge 

While docker pull works:

docker pull nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
r32.4.2-tf1.15-py3: Pulling from nvidia/l4t-tensorflow
Digest: sha256:ba57af516a1b0c021660bfe621af7f92e0fc3f17ba13a7e5a9c1c2a71355080b
Status: Image is up to date for nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3

I rebuilt the images and pushed to dockerhub https://hub.docker.com/repository/docker/povilasv/l4t-ml and you can pull from crictl and from docker hub fine. Which probably means that your docker registry is not behaving well.

Similiar issue on containerd repo, https://github.com/containerd/containerd/issues/3556, where jfrog had to fix their docker registry https://www.jfrog.com/jira/browse/RTFACT-20170.

The second question is how do I get containerd to work with GPU? I tried changing runtime in /etc/containerd/config.toml from runc to nvidia-container-runtime:

  [plugins."io.containerd.runtime.v1.linux"]
    #shim = "containerd-shim"
    #runtime = "runc"
    runtime = "/usr/bin/nvidia-container-runtime"
    #runtime_root = ""
    #no_shim = false
    #shim_debug = false

Containerd version:

containerd --version
containerd github.com/containerd/containerd 1.3.3-0ubuntu1~18.04.2 

Full containerd config file:

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    sandbox_image = "k8s.gcr.io/pause:3.1"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "runc"
      no_pivot = false
      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false
  [plugins."io.containerd.runtime.v1.linux"]
    #shim = "containerd-shim"
    #runtime = "runc"
    runtime = "/usr/bin/nvidia-container-runtime"
    #runtime_root = ""
    #no_shim = false
    #shim_debug = false
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/arm64/v8"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]
  [plugins."io.containerd.snapshotter.v1.devmapper"]
    root_path = ""
    pool_name = ""
    base_image_size = ""

This doesn’t work and I get LD library not found errors from the containers.

Thank you!

Hi,

May I know which JetPack version are you using first? v4.4 or v4.3?
Since there are different container available for the different JetPack version.

Thanks.

4.4 . I did a fresh install like 3 days ago and updated everything to latest.

sudo jetson_release 
 - NVIDIA Jetson Nano (Developer Kit Version)
   * Jetpack 4.4 DP [L4T 32.4.2]
   * NV Power Mode: MAXN - Type: 0
   * jetson_clocks service: inactive
 - Libraries:
   * CUDA: 10.2.89
   * cuDNN: 8.0.0.145
   * TensorRT: 7.1.0.16
   * Visionworks: 1.6.0.501
   * OpenCV: 4.1.1 compiled CUDA: NO
ssh jetson-0
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.9.140 aarch64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

 * Ubuntu 20.04 LTS is out, raising the bar on performance, security,
   and optimisation for Intel, AMD, Nvidia, ARM64 and Z15 as well as
   AWS, Azure and Google Cloud.

     https://ubuntu.com/blog/ubuntu-20-04-lts-arrives

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

0 packages can be updated.
0 updates are security updates.

Hi,

1. Please correct me if anything missing.
It looks like the crictl issue is caused by the different server for storing images.
Usually, we push the image to our NGC cloud here:
https://ngc.nvidia.com/catalog/containers?orderBy=modifiedDESC&pageNumber=0&query=l4t&quickFilter=containers&filters=

2.
GPU access can be enabled like this:

$ sudo docker run ... --runtime nvidia ...

Thanks.

Re 1. What do you mean?

docker pull works:

docker pull nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3

containerd pull doesn’t:

crictl pull nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
FATA[0001] pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3": failed to resolve reference "nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3": no scope specified for token auth challenge 

Re 2. You are using docker command here, the question is around containerd without docker.

Hi,

1.
crictl is not supported. Please use docker pull for this.

2.
Some CUDA related libraries are mounted from the host when runtime.
To access this files, please COPY them into the image directly.

Thanks,

Thanks for the info, I expected that containerd will work as some other nvidia runtime versions have it documented https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html#kubernetes-containerd.

I will use docker…