I just got this to work on AGX Orin dev kit 32gb.
Donāt think I skipped any steps.
Attached the following to retain yaml formatting:
aGpuEnableKubernetesVariant.txt (13.1 KB)
cat /etc/docker/daemon.json
{
ādefault-runtimeā: ānvidiaā,
āruntimesā: {
ānvidiaā: {
āpathā: ānvidia-container-runtimeā,
āruntimeArgsā:
}
}
}
#install k3s
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC=āādockerā sh -
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nfd.yaml
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/gpu-feature-discovery-daemonset.yaml
#if you want loadbalancer change address pool to your subnet.
sudo k3s kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml
cat <<EOF | sudo k3s kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.100-192.168.1.110
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default-l2
namespace: metallb-system
EOF
sudo k3s kubectl create namespace gpu-test
#or the namespace name to run gpu containers.
#_create podspec with your image and other info.
cat gpu-test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
namespace: gpu-test
spec:
hostNetwork: true
hostIPC: true
restartPolicy: Never
containers:
- name: gpu-test-pod
image: nvcr.io/nvidia/pytorch:25.03-py3
command: [ā/bin/bashā]
args: [ā-cā, āsleep infinityā]
securityContext:
privileged: true
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
volumeMounts:
- name: workspace-volume
mountPath: /workspace
volumes:
- name: workspace-volume
hostPath:
path: /home/scott/.git/k3s
type: Directory
sudo k3s kubectl apply -f gpu-test-pod.yaml
sudo k3s kubectl exec -it -n gpu-test gpu-test-pod ā bash
#then inside the k3s/kubernetes container
root@chiorin:/workspace# date
Wed Apr 9 02:13:58 UTC 2025
root@chiorin:/workspace# python
Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux
Type āhelpā, ācopyrightā, ācreditsā or ālicenseā for more information.
import torch
torch.cuda.is_available()
True
torch.cuda.get_device_name(0)
āOrinā
exit()
root@chiorin:/workspace# deviceQuery
deviceQuery Startingā¦
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: āOrinā
CUDA Driver Version: 12.6
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 268435456 MBytes (281474976710655 bytes)
(16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS
sudo k3s kubectl get all -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default pod/gpu-feature-discovery-pwkv4 0/1 Completed 0 48m
gpu-test pod/gpu-test-pod 1/1 Running 0 31m
kube-system pod/coredns-ff8999cc5-6vc64 1/1 Running 2 (38m ago) 7h52m
kube-system pod/helm-install-traefik-crd-27wss 0/1 Completed 0 7h52m
kube-system pod/helm-install-traefik-gzj78 0/1 Completed 2 7h52m
kube-system pod/local-path-provisioner-774c6665dc-kjwwx 1/1 Running 2 (38m ago) 7h52m
kube-system pod/metrics-server-6f4c6675d5-99bqm 0/1 Running 2 (38m ago) 7h52m
kube-system pod/nvidia-device-plugin-daemonset-hzpmw 1/1 Running 2 (38m ago) 4h27m
kube-system pod/svclb-traefik-834d3e94-bs4cw 0/2 CrashLoopBackOff 162 (66s ago) 5h47m
kube-system pod/traefik-67bfb46dcb-qxbvm 1/1 Running 2 (38m ago) 7h51m
metallb-system pod/controller-c76b688-2nmhx 1/1 Running 2 (38m ago) 5h53m
metallb-system pod/speaker-bth5k 1/1 Running 2 (38m ago) 5h53m
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.43.0.1 443/TCP 7h52m
kube-system service/kube-dns ClusterIP 10.43.0.10 53/UDP,53/TCP,9153/TCP 7h52m
kube-system service/metrics-server ClusterIP 10.43.41.157 443/TCP 7h52m
kube-system service/traefik LoadBalancer 10.43.161.162 192.168.1.5 80:31178/TCP,443:32226/TCP 7h51m
metallb-system service/webhook-service ClusterIP 10.43.158.187 443/TCP 5h53m
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 4h27m
kube-system daemonset.apps/svclb-traefik-834d3e94 1 1 0 1 0 7h51m
metallb-system daemonset.apps/speaker 1 1 1 1 1 kubernetes.io/os=linux 5h53m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 1/1 1 1 7h52m
kube-system deployment.apps/local-path-provisioner 1/1 1 1 7h52m
kube-system deployment.apps/metrics-server 0/1 1 0 7h52m
kube-system deployment.apps/traefik 1/1 1 1 7h51m
metallb-system deployment.apps/controller 1/1 1 1 5h53m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-ff8999cc5 1 1 1 7h52m
kube-system replicaset.apps/local-path-provisioner-774c6665dc 1 1 1 7h52m
kube-system replicaset.apps/metrics-server-6f4c6675d5 1 1 0 7h52m
kube-system replicaset.apps/traefik-67bfb46dcb 1 1 1 7h51m
metallb-system replicaset.apps/controller-c76b688 1 1 1 5h53m
NAMESPACE NAME STATUS COMPLETIONS DURATION AGE
default job.batch/gpu-feature-discovery Complete 1/1 4s 48m
kube-system job.batch/helm-install-traefik Complete 1/1 35s 7h52m
kube-system job.batch/helm-install-traefik-crd Complete 1/1 20s 7h52m