Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit

Hello everyone,

Previously, I was able to passthrough the GPU using K3s and MicroK8s on my Nvidia Jetson AGX Orin dev kit, but I can no longer do so. After upgrading to a newer version of JetPack, it stopped working. I updated the NVIDIA Container Toolkit, but it did not help.

My JetPack version:

Package: nvidia-jetpack
Source: nvidia-jetpack (6.2)
Version: 6.2+b77
Architecture: arm64
Maintainer: NVIDIA Corporation

Running containers with GPU support works as expected in Docker, but not with containerd within k3s or microk8s.

Thanks in advance for your help.

Hi,

Could you try the commands in the below link:

Thanks.

Hello, thanks for your reply, I tried it, but it’s not working. Actually, it was working before I upgraded Jetpack to the new version. I was running it via MicroK8s without any NVIDIA device plugin. I’m not sure what changed.

Hi,

Could you share which JetPack do you use before?
Is it JetPack 5 or JetPack 6.0?

When you upgrade JetPack, do you use apt command or reflash it with SDKmanager?
If you use the apt command directly, could you try to reflash the system to see if it can solve this issue?

Thanks.

Hello @AastaLLL, it was JetPack 6.1. Should I reflash it again? I don’t want tbh.. I’m not sure what to do next. Maybe I’ll try using a previous version.

I wonder if the ubuntu snap microk8s may be having issues. Their documents don’t seem to have caught up, or I’m looking in the wrong place, to Nvidia currently using sudo apt show nvidia-container-toolkit and not nvidia-container-runtime.

and cat /snap/microk8s/current/addons/core/addons/gpu/enable
#!/usr/bin/env bash
DIR=realpath $(dirname $0)
echo "
WARNING: The gpu addon has been renamed to nvidia.
Please use ā€˜microk8s enable nvidia’ instead."
$DIR/../nvidia/enable ā€œ${@}ā€

The github file looks significantly different. microk8s-core-addons/addons/nvidia/enable at 8e163efe9b8e4739e96685f3459af02c94fc54bb Ā· canonical/microk8s-core-addons Ā· GitHub

On a fresh sudo snap install microk8s --classic
microk8s returns this:
microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
2025/04/02 14:36:34.682552 cmd_run.go:1276: WARNING: cannot create user data directory: failed to verify SELinux context of /home/scott/snap: exec: ā€œmatchpathconā€: executable file not found in $PATH
No resources found in gpu-operator-resources namespace.

I will sudo snap remove --purge --terminate microk8s
and then sudo snap install microk8s --classic again to see if it will work now.

And it doesn’t
microk8s enable nvidia
2025/04/02 14:51:49.179911 cmd_run.go:1276: WARNING: cannot create user data directory: failed to verify SELinux context of /home/scott/snap: exec: ā€œmatchpathconā€: executable file not found in $PATH
Addon nvidia was not found in any repository

So, for fun I think I’ll try building it to better look at source code:
microk8s/docs/build.md at master Ā· canonical/microk8s Ā· GitHub

1 Like

Hi @AastaLLL see below, @whitesscott thanks a lot for testing.

sudo ctr run --rm -t --net-host --runtime=nvidia \
    docker.io/dustynv/pytorch:2.6-r36.4.0-cu128-24.04 \
    pytorch-container bash
ctr: failed to start shim: failed to resolve runtime path: invalid runtime name nvidia, correct runtime name should be either format like `io.containerd.runc.v1` or a full path to the binary: unknown

Then,

sudo ctr run --rm -t --net-host --gpus 0 \
    docker.io/dustynv/pytorch:2.6-r36.4.0-cu128-24.04 \
    pytorch-container bash
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: NvRmMemInitNvmap failed with Permission denied
356: Memory Manager Not supported



****NvRmMemMgrInit failed**** error type: 196626


libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196626
NvRmMemInitNvmap failed with Permission denied
356: Memory Manager Not supported



****NvRmMemMgrInit failed**** error type: 196626


libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196626
NvRmMemInitNvmap failed with Permission denied
356: Memory Manager Not supported



****NvRmMemMgrInit failed**** error type: 196626


libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196626
nvidia-container-cli: detection error: nvml error: unknown error: unknown

Don’t know is following is relevant to your use case but this works for me outside of kubernetes.

docker run -it --net=host --runtime nvidia --privileged --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /home/scott/.git/jetson-containers/packages:/workspace nvcr.io/nvidia/pytorch:25.01-py3 bash

import torch
print(torch.version)
2.6.0a0+ecf3bae40a.nv25.01
print(torch.cuda.is_available())
True

nvidia-smi

Sun Apr 6 04:45:50 2025
NVIDIA-SMI 540.4.0 Driver Version: 540.4.0 CUDA Version: 12.6
0 Orin (nvgpu)
±----------------------------------------±---------------------
No running processes found

I gave up on microk8s. the Nvidia addon disappeared and I could not get it back.

================================
I just tried the following and the pods are running in k3s.

NVIDIA GPU Operator for kubernetes.

#install k3s
curl -sfL https://get.k3s.io | sh

#so you can just run kubectl instead of ā€œk3s kubectlā€
sudo cat /etc/rancher/k3s/k3s.yaml >>~/.kube/config

#to allow sudo-less kubectl execution.
sudo setfacl -m u:scott:r /etc/rancher/k3s/k3s.yaml ; obviously, change scott to your userid.

#prereqs for installing gpu-operator
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

  • && chmod 700 get_helm.sh \*
  • && ./get_helm.sh*

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update

helm install --wait --generate-name \

  • -n gpu-operator --create-namespace \*
  • nvidia/gpu-operator \*
  • –version=v25.3.0*
  • –set driver.nvidiaDriverCRD.enabled=true*

kubectl get -n gpu-operator po
NAME READY STATUS RESTARTS AGE
gpu-operator-1743918301-node-feature-discovery-gc-7bb9bff5sp9fk 1/1 Running 0 48m
gpu-operator-1743918301-node-feature-discovery-master-69bbrlw7p 1/1 Running 0 48m
gpu-operator-1743918301-node-feature-discovery-worker-2lbbb 1/1 Running 0 48m
gpu-operator-6cb986b754-4tfkc 1/1 Running 0 48m

There’s a nice /usr/local/bin/k3s-uninstall.sh
that does a good job or removing k3s if needed.

Hi @whitesscott, thanks for the detailed report, I’ve tried it, but the issue persists. Have you checked the GPU availability inside the pod? As far as I remember, the NVIDIA GPU Operator does not work on Tegra devices. IMHO, containerd is broken.

In my case, it always shows.

>>> torch.cuda.is_available()
False
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 491, in get_device_name
    return get_device_properties(device).name
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

I think I’m using a differently built pytorch

#just updated to most recent image.
ngc registry image pull nvcr.io/nvidia/pytorch:25.03-py3

docker run -it --net=host --runtime nvidia --privileged --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /home/scott/.git/jetson-containers/packages:/workspace nvcr.io/nvidia/pytorch:25.03-py3 bash

Following are from within the container:

root@chiorin:/workspace# uname -a
Linux chiorin 5.15.148-tegra #1 SMP PREEMPT Tue Jan 7 17:14:38 PST 2025 aarch64 aarch64 aarch64 GNU/Linux
root@chiorin:/workspace# cat /etc/nv_tegra_release
#R36 (release), REVISION: 4.3, GCID: 38968081, BOARD: generic, EABI: aarch64, DATE: Wed Jan 8 01:49:37 UTC 2025
#KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

root@chiorin:/workspace#

python
Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0]

import torch
torch.cuda.is_available()
True
torch.cuda.get_device_name(0)
ā€˜Orin’

root@chiorin:/workspace# deviceQuery

deviceQuery Starting…

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: ā€œOrinā€
CUDA Driver Version: 12.6
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 268435456 MBytes (281474976710655 bytes)
(16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

1 Like

Hi,

Do you mean the command works with JetPack 6.1 but fail after upgrading to JetPack 6.2?

NvRmMemInitNvmap failed with Permission denied

This looks like a permission issue.
Could you try the setting in the below link?

Thanks.

1 Like

@AastaLLL Docker containers work fine with GPU sharing using the nvidia runtime option. I’m talking about pods within Kubernetes. I was able to run it before using k3s and microk8s. But currently can not do it.

I just got this to work on AGX Orin dev kit 32gb.
Don’t think I skipped any steps.

Attached the following to retain yaml formatting:
aGpuEnableKubernetesVariant.txt (13.1 KB)


cat /etc/docker/daemon.json
{
ā€œdefault-runtimeā€: ā€œnvidiaā€,
ā€œruntimesā€: {
ā€œnvidiaā€: {
ā€œpathā€: ā€œnvidia-container-runtimeā€,
ā€œruntimeArgsā€:
}
}
}

#install k3s
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC=ā€œā€“dockerā€ sh -

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nfd.yaml

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/gpu-feature-discovery-daemonset.yaml


#if you want loadbalancer change address pool to your subnet.

sudo k3s kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml

cat <<EOF | sudo k3s kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:

  • 192.168.1.100-192.168.1.110

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default-l2
namespace: metallb-system
EOF


sudo k3s kubectl create namespace gpu-test
#or the namespace name to run gpu containers.

#_create podspec with your image and other info.
cat gpu-test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
namespace: gpu-test
spec:
hostNetwork: true
hostIPC: true
restartPolicy: Never
containers:

  • name: gpu-test-pod
    image: nvcr.io/nvidia/pytorch:25.03-py3
    command: [ā€œ/bin/bashā€]
    args: [ā€œ-cā€, ā€œsleep infinityā€]
    securityContext:
    privileged: true
    resources:
    limits:
    nvidia.com/gpu: 1
    env:
    • name: NVIDIA_VISIBLE_DEVICES
      value: all
    • name: NVIDIA_DRIVER_CAPABILITIES
      value: all
      volumeMounts:
    • name: workspace-volume
      mountPath: /workspace
      volumes:
  • name: workspace-volume
    hostPath:
    path: /home/scott/.git/k3s
    type: Directory

sudo k3s kubectl apply -f gpu-test-pod.yaml


sudo k3s kubectl exec -it -n gpu-test gpu-test-pod – bash

#then inside the k3s/kubernetes container


root@chiorin:/workspace# date
Wed Apr 9 02:13:58 UTC 2025

root@chiorin:/workspace# python
Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux
Type ā€œhelpā€, ā€œcopyrightā€, ā€œcreditsā€ or ā€œlicenseā€ for more information.

import torch
torch.cuda.is_available()
True
torch.cuda.get_device_name(0)
ā€˜Orin’
exit()

root@chiorin:/workspace# deviceQuery
deviceQuery Starting…

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: ā€œOrinā€
CUDA Driver Version: 12.6
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 268435456 MBytes (281474976710655 bytes)
(16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS


sudo k3s kubectl get all -A

NAMESPACE NAME READY STATUS RESTARTS AGE
default pod/gpu-feature-discovery-pwkv4 0/1 Completed 0 48m
gpu-test pod/gpu-test-pod 1/1 Running 0 31m
kube-system pod/coredns-ff8999cc5-6vc64 1/1 Running 2 (38m ago) 7h52m
kube-system pod/helm-install-traefik-crd-27wss 0/1 Completed 0 7h52m
kube-system pod/helm-install-traefik-gzj78 0/1 Completed 2 7h52m
kube-system pod/local-path-provisioner-774c6665dc-kjwwx 1/1 Running 2 (38m ago) 7h52m
kube-system pod/metrics-server-6f4c6675d5-99bqm 0/1 Running 2 (38m ago) 7h52m
kube-system pod/nvidia-device-plugin-daemonset-hzpmw 1/1 Running 2 (38m ago) 4h27m
kube-system pod/svclb-traefik-834d3e94-bs4cw 0/2 CrashLoopBackOff 162 (66s ago) 5h47m
kube-system pod/traefik-67bfb46dcb-qxbvm 1/1 Running 2 (38m ago) 7h51m
metallb-system pod/controller-c76b688-2nmhx 1/1 Running 2 (38m ago) 5h53m
metallb-system pod/speaker-bth5k 1/1 Running 2 (38m ago) 5h53m

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.43.0.1 443/TCP 7h52m
kube-system service/kube-dns ClusterIP 10.43.0.10 53/UDP,53/TCP,9153/TCP 7h52m
kube-system service/metrics-server ClusterIP 10.43.41.157 443/TCP 7h52m
kube-system service/traefik LoadBalancer 10.43.161.162 192.168.1.5 80:31178/TCP,443:32226/TCP 7h51m
metallb-system service/webhook-service ClusterIP 10.43.158.187 443/TCP 5h53m

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 4h27m
kube-system daemonset.apps/svclb-traefik-834d3e94 1 1 0 1 0 7h51m
metallb-system daemonset.apps/speaker 1 1 1 1 1 kubernetes.io/os=linux 5h53m

NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 1/1 1 1 7h52m
kube-system deployment.apps/local-path-provisioner 1/1 1 1 7h52m
kube-system deployment.apps/metrics-server 0/1 1 0 7h52m
kube-system deployment.apps/traefik 1/1 1 1 7h51m
metallb-system deployment.apps/controller 1/1 1 1 5h53m

NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-ff8999cc5 1 1 1 7h52m
kube-system replicaset.apps/local-path-provisioner-774c6665dc 1 1 1 7h52m
kube-system replicaset.apps/metrics-server-6f4c6675d5 1 1 0 7h52m
kube-system replicaset.apps/traefik-67bfb46dcb 1 1 1 7h51m
metallb-system replicaset.apps/controller-c76b688 1 1 1 5h53m

NAMESPACE NAME STATUS COMPLETIONS DURATION AGE
default job.batch/gpu-feature-discovery Complete 1/1 4s 48m
kube-system job.batch/helm-install-traefik Complete 1/1 35s 7h52m
kube-system job.batch/helm-install-traefik-crd Complete 1/1 20s 7h52m