Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit

shahizat · March 27, 2025, 6:33pm

Hello everyone,

Previously, I was able to passthrough the GPU using K3s and MicroK8s on my Nvidia Jetson AGX Orin dev kit, but I can no longer do so. After upgrading to a newer version of JetPack, it stopped working. I updated the NVIDIA Container Toolkit, but it did not help.

My JetPack version:

Package: nvidia-jetpack
Source: nvidia-jetpack (6.2)
Version: 6.2+b77
Architecture: arm64
Maintainer: NVIDIA Corporation

Running containers with GPU support works as expected in Docker, but not with containerd within k3s or microk8s.

Thanks in advance for your help.

AastaLLL · March 31, 2025, 4:03am

Hi,

Could you try the commands in the below link:

Thanks.

shahizat · March 31, 2025, 4:11am

Hello, thanks for your reply, I tried it, but it’s not working. Actually, it was working before I upgraded Jetpack to the new version. I was running it via MicroK8s without any NVIDIA device plugin. I’m not sure what changed.

AastaLLL · April 2, 2025, 3:28am

Hi,

Could you share which JetPack do you use before?
Is it JetPack 5 or JetPack 6.0?

When you upgrade JetPack, do you use apt command or reflash it with SDKmanager?
If you use the apt command directly, could you try to reflash the system to see if it can solve this issue?

Thanks.

shahizat · April 2, 2025, 5:59pm

Hello @AastaLLL, it was JetPack 6.1. Should I reflash it again? I don’t want tbh.. I’m not sure what to do next. Maybe I’ll try using a previous version.

whitesscott · April 2, 2025, 9:59pm

I wonder if the ubuntu snap microk8s may be having issues. Their documents don’t seem to have caught up, or I’m looking in the wrong place, to Nvidia currently using sudo apt show nvidia-container-toolkit and not nvidia-container-runtime.

and cat /snap/microk8s/current/addons/core/addons/gpu/enable
#!/usr/bin/env bash
DIR=realpath $(dirname $0)
echo "
WARNING: The gpu addon has been renamed to nvidia.
Please use ‘microk8s enable nvidia’ instead."
$DIR/../nvidia/enable “${@}”

The github file looks significantly different. microk8s-core-addons/addons/nvidia/enable at 8e163efe9b8e4739e96685f3459af02c94fc54bb · canonical/microk8s-core-addons · GitHub

On a fresh sudo snap install microk8s --classic
microk8s returns this:
microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
2025/04/02 14:36:34.682552 cmd_run.go:1276: WARNING: cannot create user data directory: failed to verify SELinux context of /home/scott/snap: exec: “matchpathcon”: executable file not found in $PATH
No resources found in gpu-operator-resources namespace.

I will sudo snap remove --purge --terminate microk8s
and then sudo snap install microk8s --classic again to see if it will work now.

And it doesn’t
microk8s enable nvidia
2025/04/02 14:51:49.179911 cmd_run.go:1276: WARNING: cannot create user data directory: failed to verify SELinux context of /home/scott/snap: exec: “matchpathcon”: executable file not found in $PATH
Addon nvidia was not found in any repository

So, for fun I think I’ll try building it to better look at source code:
microk8s/docs/build.md at master · canonical/microk8s · GitHub

shahizat · April 5, 2025, 1:35am

Hi @AastaLLL see below, @whitesscott thanks a lot for testing.

sudo ctr run --rm -t --net-host --runtime=nvidia \
    docker.io/dustynv/pytorch:2.6-r36.4.0-cu128-24.04 \
    pytorch-container bash
ctr: failed to start shim: failed to resolve runtime path: invalid runtime name nvidia, correct runtime name should be either format like `io.containerd.runc.v1` or a full path to the binary: unknown

Then,

sudo ctr run --rm -t --net-host --gpus 0 \
    docker.io/dustynv/pytorch:2.6-r36.4.0-cu128-24.04 \
    pytorch-container bash
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: NvRmMemInitNvmap failed with Permission denied
356: Memory Manager Not supported



****NvRmMemMgrInit failed**** error type: 196626


libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196626
NvRmMemInitNvmap failed with Permission denied
356: Memory Manager Not supported



****NvRmMemMgrInit failed**** error type: 196626


libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196626
NvRmMemInitNvmap failed with Permission denied
356: Memory Manager Not supported



****NvRmMemMgrInit failed**** error type: 196626


libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196626
nvidia-container-cli: detection error: nvml error: unknown error: unknown

whitesscott · April 6, 2025, 6:40am

Don’t know is following is relevant to your use case but this works for me outside of kubernetes.

docker run -it --net=host --runtime nvidia --privileged --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /home/scott/.git/jetson-containers/packages:/workspace nvcr.io/nvidia/pytorch:25.01-py3 bash

import torch
print(torch.version)
2.6.0a0+ecf3bae40a.nv25.01
print(torch.cuda.is_available())
True

nvidia-smi

Sun Apr 6 04:45:50 2025
NVIDIA-SMI 540.4.0 Driver Version: 540.4.0 CUDA Version: 12.6
0 Orin (nvgpu)
±----------------------------------------±---------------------
No running processes found

I gave up on microk8s. the Nvidia addon disappeared and I could not get it back.

================================
I just tried the following and the pods are running in k3s.

NVIDIA GPU Operator for kubernetes.

#install k3s
curl -sfL https://get.k3s.io | sh

#so you can just run kubectl instead of “k3s kubectl”
sudo cat /etc/rancher/k3s/k3s.yaml >>~/.kube/config

#to allow sudo-less kubectl execution.
sudo setfacl -m u:scott:r /etc/rancher/k3s/k3s.yaml ; obviously, change scott to your userid.

#prereqs for installing gpu-operator
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \*
&& ./get_helm.sh*

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update

helm install --wait --generate-name \

-n gpu-operator --create-namespace \*
nvidia/gpu-operator \*
–version=v25.3.0*
–set driver.nvidiaDriverCRD.enabled=true*

kubectl get -n gpu-operator po
NAME READY STATUS RESTARTS AGE
gpu-operator-1743918301-node-feature-discovery-gc-7bb9bff5sp9fk 1/1 Running 0 48m
gpu-operator-1743918301-node-feature-discovery-master-69bbrlw7p 1/1 Running 0 48m
gpu-operator-1743918301-node-feature-discovery-worker-2lbbb 1/1 Running 0 48m
gpu-operator-6cb986b754-4tfkc 1/1 Running 0 48m

There’s a nice /usr/local/bin/k3s-uninstall.sh
that does a good job or removing k3s if needed.

shahizat · April 6, 2025, 5:39pm

Hi @whitesscott, thanks for the detailed report, I’ve tried it, but the issue persists. Have you checked the GPU availability inside the pod? As far as I remember, the NVIDIA GPU Operator does not work on Tegra devices. IMHO, containerd is broken.

In my case, it always shows.

>>> torch.cuda.is_available()
False
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 491, in get_device_name
    return get_device_properties(device).name
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

whitesscott · April 7, 2025, 1:08am

I think I’m using a differently built pytorch

#just updated to most recent image.
ngc registry image pull nvcr.io/nvidia/pytorch:25.03-py3

docker run -it --net=host --runtime nvidia --privileged --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /home/scott/.git/jetson-containers/packages:/workspace nvcr.io/nvidia/pytorch:25.03-py3 bash

Following are from within the container:

root@chiorin:/workspace# uname -a
Linux chiorin 5.15.148-tegra #1 SMP PREEMPT Tue Jan 7 17:14:38 PST 2025 aarch64 aarch64 aarch64 GNU/Linux
root@chiorin:/workspace# cat /etc/nv_tegra_release
#R36 (release), REVISION: 4.3, GCID: 38968081, BOARD: generic, EABI: aarch64, DATE: Wed Jan 8 01:49:37 UTC 2025
#KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

root@chiorin:/workspace#

python
Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0]

import torch
torch.cuda.is_available()
True
torch.cuda.get_device_name(0)
‘Orin’

root@chiorin:/workspace# deviceQuery

deviceQuery Starting…

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: “Orin”
CUDA Driver Version: 12.6
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 268435456 MBytes (281474976710655 bytes)
(16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

AastaLLL · April 7, 2025, 6:28am

Hi,

Do you mean the command works with JetPack 6.1 but fail after upgrading to JetPack 6.2?

NvRmMemInitNvmap failed with Permission denied

This looks like a permission issue.
Could you try the setting in the below link?

Thanks.

shahizat · April 7, 2025, 8:18am

@AastaLLL Docker containers work fine with GPU sharing using the nvidia runtime option. I’m talking about pods within Kubernetes. I was able to run it before using k3s and microk8s. But currently can not do it.

whitesscott · April 9, 2025, 3:40am

I just got this to work on AGX Orin dev kit 32gb.
Don’t think I skipped any steps.

Attached the following to retain yaml formatting:
aGpuEnableKubernetesVariant.txt (13.1 KB)

cat /etc/docker/daemon.json
{
“default-runtime”: “nvidia”,
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”:
}
}
}

#install k3s
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC=“–docker” sh -

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nfd.yaml

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/gpu-feature-discovery-daemonset.yaml

#if you want loadbalancer change address pool to your subnet.

sudo k3s kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml

cat <<EOF | sudo k3s kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:

192.168.1.100-192.168.1.110

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default-l2
namespace: metallb-system
EOF

sudo k3s kubectl create namespace gpu-test
#or the namespace name to run gpu containers.

#_create podspec with your image and other info.
cat gpu-test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
namespace: gpu-test
spec:
hostNetwork: true
hostIPC: true
restartPolicy: Never
containers:

name: gpu-test-pod
image: nvcr.io/nvidia/pytorch:25.03-py3
command: [“/bin/bash”]
args: [“-c”, “sleep infinity”]
securityContext:
privileged: true
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
  value: all
- name: NVIDIA_DRIVER_CAPABILITIES
  value: all
  volumeMounts:
- name: workspace-volume
  mountPath: /workspace
  volumes:
name: workspace-volume
hostPath:
path: /home/scott/.git/k3s
type: Directory

sudo k3s kubectl apply -f gpu-test-pod.yaml

sudo k3s kubectl exec -it -n gpu-test gpu-test-pod – bash

#then inside the k3s/kubernetes container

root@chiorin:/workspace# date
Wed Apr 9 02:13:58 UTC 2025

root@chiorin:/workspace# python
Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
True
torch.cuda.get_device_name(0)
‘Orin’
exit()

root@chiorin:/workspace# deviceQuery
deviceQuery Starting…

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: “Orin”
CUDA Driver Version: 12.6
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 268435456 MBytes (281474976710655 bytes)
(16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

sudo k3s kubectl get all -A

NAMESPACE NAME READY STATUS RESTARTS AGE
default pod/gpu-feature-discovery-pwkv4 0/1 Completed 0 48m
gpu-test pod/gpu-test-pod 1/1 Running 0 31m
kube-system pod/coredns-ff8999cc5-6vc64 1/1 Running 2 (38m ago) 7h52m
kube-system pod/helm-install-traefik-crd-27wss 0/1 Completed 0 7h52m
kube-system pod/helm-install-traefik-gzj78 0/1 Completed 2 7h52m
kube-system pod/local-path-provisioner-774c6665dc-kjwwx 1/1 Running 2 (38m ago) 7h52m
kube-system pod/metrics-server-6f4c6675d5-99bqm 0/1 Running 2 (38m ago) 7h52m
kube-system pod/nvidia-device-plugin-daemonset-hzpmw 1/1 Running 2 (38m ago) 4h27m
kube-system pod/svclb-traefik-834d3e94-bs4cw 0/2 CrashLoopBackOff 162 (66s ago) 5h47m
kube-system pod/traefik-67bfb46dcb-qxbvm 1/1 Running 2 (38m ago) 7h51m
metallb-system pod/controller-c76b688-2nmhx 1/1 Running 2 (38m ago) 5h53m
metallb-system pod/speaker-bth5k 1/1 Running 2 (38m ago) 5h53m

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.43.0.1 443/TCP 7h52m
kube-system service/kube-dns ClusterIP 10.43.0.10 53/UDP,53/TCP,9153/TCP 7h52m
kube-system service/metrics-server ClusterIP 10.43.41.157 443/TCP 7h52m
kube-system service/traefik LoadBalancer 10.43.161.162 192.168.1.5 80:31178/TCP,443:32226/TCP 7h51m
metallb-system service/webhook-service ClusterIP 10.43.158.187 443/TCP 5h53m

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 4h27m
kube-system daemonset.apps/svclb-traefik-834d3e94 1 1 0 1 0 7h51m
metallb-system daemonset.apps/speaker 1 1 1 1 1 kubernetes.io/os=linux 5h53m

NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 1/1 1 1 7h52m
kube-system deployment.apps/local-path-provisioner 1/1 1 1 7h52m
kube-system deployment.apps/metrics-server 0/1 1 0 7h52m
kube-system deployment.apps/traefik 1/1 1 1 7h51m
metallb-system deployment.apps/controller 1/1 1 1 5h53m

NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-ff8999cc5 1 1 1 7h52m
kube-system replicaset.apps/local-path-provisioner-774c6665dc 1 1 1 7h52m
kube-system replicaset.apps/metrics-server-6f4c6675d5 1 1 0 7h52m
kube-system replicaset.apps/traefik-67bfb46dcb 1 1 1 7h51m
metallb-system replicaset.apps/controller-c76b688 1 1 1 5h53m

NAMESPACE NAME STATUS COMPLETIONS DURATION AGE
default job.batch/gpu-feature-discovery Complete 1/1 4s 48m
kube-system job.batch/helm-install-traefik Complete 1/1 35s 7h52m
kube-system job.batch/helm-install-traefik-crd Complete 1/1 20s 7h52m

shahizat · April 18, 2025, 11:35am

Hello @whitesscott, @AastaLLL I just published a post on Hackster about using docker as the default runtime for K3s, since I couldn’t get it running with containerd. Thanks a lot for your help! Deploy LLMs with vLLM on NVIDIA Jetson AGX Orin Dev Kit - Hackster.io

whitesscott · April 20, 2025, 6:03am

Excellent article. I think I’ll try vLLM soon. Thanks, Scott

system · May 4, 2025, 6:04am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson Orin Nano Dev Board Pods Stuck in ContainersCreating State Jetson Orin Nano docker , kubernetes	7	279	July 30, 2024
[BUG] target-docker-container running cuda-samples require unintended extra permission DRIVE AGX Orin General docker	12	1516	May 30, 2023
JetPack 6.3 containerd and kubernetes Jetson AGX Orin nvbugs , containers	12	830	August 22, 2024
CUDA runtime on Jetson Orin AGX Jetson AGX Orin cuda	46	6252	September 1, 2023
Rootless Podman Container - CUDA Operation Not Supported - Error Code 801 DRIVE AGX Orin General driveos-cuda	11	548	October 10, 2024
Can't find GPU in Kubernets on Jetson Nano cluster Jetson Nano nvbugs , neural-network-framework	27	4045	October 18, 2021
"NvRmMemInitNvmap failed with Permission denied" error when running nvidia-docker in rootless mode on Jetson Orin Nano Jetson Orin Nano jetson-inference	27	246	May 13, 2025
K3s on Jetson Nano 4GB with Jetson MATE \| (NVIDIA Container Toolkit) upgrade Jetson Nano containers , gpu	15	2883	September 25, 2023
Announcing containerd Support for the NVIDIA GPU Operator Technical Blog	14	1940	January 21, 2022
New AGX Orin 64GB - no GPU? Jetson AGX Orin cuda	4	1020	July 25, 2023

Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit

Related topics