Jetson Orin Nano Dev Board Pods Stuck in ContainersCreating State

I have a Jetson Orin Nano and am trying to run k3s on it. However, all the pods/containers will never be created:

tyler@orin-nano-01:~$ curl -sfL https://get.k3s.io | sh -s - --docker --write-kubeconfig-mode 644 --write-kubeconfig $HOME/.kube/config
[INFO]  Finding release for channel stable
[INFO]  Using v1.29.6+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
tyler@orin-nano-01:~$ kubectl get nodes
NAME           STATUS   ROLES                  AGE   VERSION
orin-nano-01   Ready    control-plane,master   17s   v1.29.6+k3s1
tyler@orin-nano-01:~$ kubectl get pods -A
NAMESPACE     NAME                                     READY   STATUS              RESTARTS   AGE
kube-system   coredns-6799fbcd5-f9mgq                  0/1     ContainerCreating   0          7s
kube-system   helm-install-traefik-5892k               0/1     ContainerCreating   0          8s
kube-system   helm-install-traefik-crd-xlkb2           0/1     ContainerCreating   0          8s
kube-system   local-path-provisioner-6f5d79df6-5bjpw   0/1     ContainerCreating   0          7s
kube-system   metrics-server-54fd9b65b-2szjw           0/1     ContainerCreating   0          7s
tyler@orin-nano-01:~$ kubectl describe pod coredns-6799fbcd5-f9mgq -n kube-system
Name:                 coredns-6799fbcd5-f9mgq
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 orin-nano-01/192.168.1.230
Start Time:           Fri, 05 Jul 2024 15:28:04 -0500
Labels:               k8s-app=kube-dns
                      pod-template-hash=6799fbcd5
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-6799fbcd5
Containers:
  coredns:
    Container ID:  
    Image:         rancher/mirrored-coredns-coredns:1.10.1
    Image ID:      
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /etc/coredns/custom from custom-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vdbv4 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  custom-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns-custom
    Optional:  true
  kube-api-access-vdbv4:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               kubernetes.io/os=linux
Tolerations:                  CriticalAddonsOnly op=Exists
                              node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                              node-role.kubernetes.io/master:NoSchedule op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
  Type     Reason                  Age              From               Message
  ----     ------                  ----             ----               -------
  Normal   Scheduled               19s              default-scheduler  Successfully assigned kube-system/coredns-6799fbcd5-f9mgq to orin-nano-01
  Warning  FailedCreatePodSandBox  15s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "coredns-6799fbcd5-f9mgq": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod37077870_be63_4b9e_9bfa_872ce17336ca.slice/docker-6f56fb05536e2283dd3f00dc1a83ec36b0d7d5f1bde2ccd22643b87bcc0146ed.scope/cpu.weight: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  5s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "coredns-6799fbcd5-f9mgq": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod37077870_be63_4b9e_9bfa_872ce17336ca.slice/docker-a2c97b1da3c071dc360217b408e7fc0cd11fb1b8afd2b4e9c67283ddf1f5d083.scope/cpu.weight: no such file or directory: unknown
  Normal   SandboxChanged          4s (x2 over 9s)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  1s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "coredns-6799fbcd5-f9mgq": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod37077870_be63_4b9e_9bfa_872ce17336ca.slice/docker-497c9db51f20eb47b69e87fa46232f891b1747c6b403106989de0116c960ed71.scope/cpu.weight: no such file or directory: unknown

Some digging online seemed to point to cgroup related problems, but my tinkering with that hasn’t led to much.

Some background:

  • I’m running a custom kernel to enable the iSCSI TCP module. I followed this guide, and enabled CONFIG_ISCI_TCP=m and CONFIG_SCSI_ISCSI_ATTRS=m (to eventually support Longhorn pods). I’ve also enabled CONFIG_FAIR_GROUP_SCHED=y and CONFIG_RT_GROUP_SCHED=y in attempts to fix this issue (to no avail). Everything else should be standard.

    • When I was running the “standard” kernel, k3s was able to create and run the pods.
  • I’m booting directly from an SSD, following this quick start guide.

  • I’ve updated and upgraded packages with sudo apt update && sudo apt upgrade.

  • I’m running the latest version of Jetpack:

    tyler@orin-nano-01:~$ apt list --installed | grep nvidia-jetpack
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    nvidia-jetpack-dev/stable,now 6.0+b106 arm64 [installed,automatic]
    nvidia-jetpack-runtime/stable,now 6.0+b106 arm64 [installed,automatic]
    nvidia-jetpack/stable,now 6.0+b106 arm64 [installed]
    
  • Other machine info:

    tyler@orin-nano-01:~$ uname -a
    Linux orin-nano-01 5.15.136-rt-tegra #5 SMP PREEMPT_RT Fri Jul 5 13:52:58 CDT 2024 aarch64 aarch64 aarch64 GNU/Linux
    

Can anyone provide guidance on what I might be missing or what additional steps I should take to fix this?

Hi,

There is a known issue in the nvidia-container.
COuld you check the below comment and check if it can help with your issue as well?

Thanks.

I ran sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml --mode=csv and now get this output:

tyler@orin-nano-01:~$ nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=all

However the same issue persists for all pods:

  Warning  Failed            25s (x2 over 47s)      kubelet            Error: failed to start container "coredns": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2f270a53_b746_4fef_8e7c_0030e1910a27.slice/docker-coredns.scope/cpu.max: no such file or directory: unknown

This is my Docker daemon config for reference:

tyler@orin-nano-01:~$ sudo cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

I’m not sure if this is a red herring or not, but attempting to verify CDI devices can be used from a Docker container:

tyler@orin-nano-01:~$ docker run --rm -ti --runtime=nvidia nvcr.io/nvidia/k8s/cuda-sample:devicequery-cuda12.5.0-ubuntu22.04
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to create NVIDIA Container Runtime: failed to construct OCI spec modifier: requirements not met: cuda>=12.5||brand=unknown&&driver>=470&&driver<471||brand=grid&&driver>=470&&driver<471||brand=tesla&&driver>=470&&driver<471||brand=nvidia&&driver>=470&&driver<471||brand=quadro&&driver>=470&&driver<471||brand=quadrortx&&driver>=470&&driver<471||brand=nvidiartx&&driver>=470&&driver<471||brand=vapps&&driver>=470&&driver<471||brand=vpc&&driver>=470&&driver<471||brand=vcs&&driver>=470&&driver<471||brand=vws&&driver>=470&&driver<471||brand=cloudgaming&&driver>=470&&driver<471||brand=unknown&&driver>=535&&driver<536||brand=grid&&driver>=535&&driver<536||brand=tesla&&driver>=535&&driver<536||brand=nvidia&&driver>=535&&driver<536||brand=quadro&&driver>=535&&driver<536||brand=quadrortx&&driver>=535&&driver<536||brand=nvidiartx&&driver>=535&&driver<536||brand=vapps&&driver>=535&&driver<536||brand=vpc&&driver>=535&&driver<536||brand=vcs&&driver>=535&&driver<536||brand=vws&&driver>=535&&driver<536||brand=cloudgaming&&driver>=535&&driver<536||brand=unknown&&driver>=550&&driver<551||brand=grid&&driver>=550&&driver<551||brand=tesla&&driver>=550&&driver<551||brand=nvidia&&driver>=550&&driver<551||brand=quadro&&driver>=550&&driver<551||brand=quadrortx&&driver>=550&&driver<551||brand=nvidiartx&&driver>=550&&driver<551||brand=vapps&&driver>=550&&driver<551||brand=vpc&&driver>=550&&driver<551||brand=vcs&&driver>=550&&driver<551||brand=vws&&driver>=550&&driver<551||brand=cloudgaming&&driver>=550&&driver<551 not met: unknown.

The output of nvidia-smi for reference:

tyler@orin-nano-01:~$ nvidia-smi
Mon Jul  8 09:43:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.3.0                Driver Version: N/A          CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

So I installed CUDA 12.5 seemingly successfully, but nvidia-smi’s output remains the same:

tyler@orin-nano-01:~$ nvidia-smi
Mon Jul  8 10:13:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.3.0                Driver Version: N/A          CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Again, not sure if this piece of info is distracting or not.

Hi,

The container you test is for a desktop environment.

For Orin, please try l4t-cuda below instead:

Thanks.

Ah, running docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-cuda:12.2.12-devel /bin/bash and compiling deviceQuery from the cuda-samples repo (tag v12.2), these are the results:

root@orin-nano-01:/cuda-samples-12.2/Samples/1_Utilities/deviceQuery# ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 7622 MBytes (7991873536 bytes)
  (008) Multiprocessors, (128) CUDA Cores/MP:    1024 CUDA Cores
  GPU Max Clock rate:                            624 MHz (0.62 GHz)
  Memory Clock rate:                             624 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1
Result = PASS

This seems to align pretty well with running the deviceQuery natively on the Jetson Orin Nano:

tyler@orin-nano-01:~/Downloads/cuda-samples-12.2/Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.2 / 12.5
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 7622 MBytes (7991873536 bytes)
  (008) Multiprocessors, (128) CUDA Cores/MP:    1024 CUDA Cores
  GPU Max Clock rate:                            624 MHz (0.62 GHz)
  Memory Clock rate:                             624 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.5, NumDevs = 1
Result = PASS

So it seems we’re back to square 1?

So I went back and turns out disabling the real-time configuration for the kernel (and just building the generic one) fixes this issue. Now the pods run as expected!

tyler@orin-nano-01:~$ kubectl get pods -A
NAMESPACE     NAME                                     READY   STATUS      RESTARTS   AGE
kube-system   coredns-6799fbcd5-c9nhn                  1/1     Running     0          41s
kube-system   helm-install-traefik-8wjzw               0/1     Completed   1          41s
kube-system   helm-install-traefik-crd-9cmjv           0/1     Completed   0          41s
kube-system   local-path-provisioner-6f5d79df6-kkp6c   1/1     Running     0          41s
kube-system   metrics-server-54fd9b65b-rkt78           1/1     Running     0          41s
kube-system   svclb-traefik-cd82205c-d4qw6             2/2     Running     0          25s
kube-system   traefik-7d5f6474df-5whb6                 1/1     Running     0          25s

Thanks for the update!