Greeting to all,
I have been trying to install microk8s on the NVIDIA Jetson AGX Orin Developer Kit and have made some progress. But, I have not been able to see the Tegra GPU inside the Pod. I was just wondering if someone has already tried this.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 423, in get_device_name
return get_device_properties(device).name
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 453, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 302, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
I installed the Nvidia device plugin using the following command:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
Here are the logs of the device plugin. It was able to detect the GPU of Jetson Orin:
2024/08/22 03:49:43 Starting FS watcher.
2024/08/22 03:49:43 Starting OS watcher.
2024/08/22 03:49:43 Starting Plugins.
2024/08/22 03:49:43 Loading configuration.
2024/08/22 03:49:43 Updating config with default resource matching patterns.
2024/08/22 03:49:43
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2024/08/22 03:49:43 Retreiving plugins.
2024/08/22 03:49:43 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2024/08/22 03:49:43 Detected Tegra platform: /sys/devices/soc0/family has 'tegra' prefix
2024/08/22 03:49:43 Starting GRPC server for 'nvidia.com/gpu'
2024/08/22 03:49:43 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/08/22 03:49:43 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Here is my YAML file:
apiVersion: v1
kind: Pod
metadata:
name: torch
spec:
imagePullSecrets:
- name: my-image-pull-secret
containers:
- name: torchtest
image: dustynv/l4t-pytorch:r36.2.0
securityContext:
privileged: true
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Thanks in advance for your support!