Installation of single node microK8s on the NVIDIA Jetson AGX Orin Developer Kit

Greeting to all,

I have been trying to install microk8s on the NVIDIA Jetson AGX Orin Developer Kit and have made some progress. But, I have not been able to see the Tegra GPU inside the Pod. I was just wondering if someone has already tried this.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 423, in get_device_name
    return get_device_properties(device).name
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 453, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 302, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

I installed the Nvidia device plugin using the following command:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

Here are the logs of the device plugin. It was able to detect the GPU of Jetson Orin:

2024/08/22 03:49:43 Starting FS watcher.
2024/08/22 03:49:43 Starting OS watcher.
2024/08/22 03:49:43 Starting Plugins.
2024/08/22 03:49:43 Loading configuration.
2024/08/22 03:49:43 Updating config with default resource matching patterns.
2024/08/22 03:49:43 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2024/08/22 03:49:43 Retreiving plugins.
2024/08/22 03:49:43 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2024/08/22 03:49:43 Detected Tegra platform: /sys/devices/soc0/family has 'tegra' prefix
2024/08/22 03:49:43 Starting GRPC server for 'nvidia.com/gpu'
2024/08/22 03:49:43 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/08/22 03:49:43 Registered device plugin for 'nvidia.com/gpu' with Kubelet

Here is my YAML file:

apiVersion: v1
kind: Pod
metadata:
  name: torch
spec:
  imagePullSecrets:
  - name: my-image-pull-secret
  containers:
  - name: torchtest
    image: dustynv/l4t-pytorch:r36.2.0
    securityContext:
      privileged: true
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    resources:
      limits:
        nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Thanks in advance for your support!

Hi @shahizat! Are you able to run CUDA stuff in PyTorch in this container (I suppose torch.cuda.get_device_name() like the callstack) or is it just k8s?

This is the C source of the torch._C._cuda_init() function: https://github.com/pytorch/pytorch/blob/de06345e9b7221dfa8d2ca90e1a40e03aa32004f/torch/csrc/cuda/Module.cpp#L1379

This is what prints that error, in response to cudaErrorInsufficientDriver
https://github.com/pytorch/pytorch/blob/de06345e9b7221dfa8d2ca90e1a40e03aa32004f/c10/cuda/CUDAFunctions.cpp#L39

If you ran deviceQuery in this container, would it report the same? If so, perhaps something during the kubernetes installation interfered with the drivers, or installed dGPU drivers from apt, things like that. Fortunately just a container to wipe and try again :)

1 Like

Hi @dusty_nv, thanks for your reply. 😊 I was able to run your Docker container and get torch.cuda.get_device_name() working without k8s, with the following results:

>>> import torch
>>> torch.cuda.get_device_name()
'Orin'

The same result I expected to see within microK8s environment. I’m using the container(dustynv/l4t-pytorch:r36.2.0) for my microK8s based YAML file, as you can see above. BTW, I was unable to determine the location of ‘deviceQuery’ inside the container. I plan to check the installation of the cluster using k3s and verify the functionality of GPU passthrough there. I suspect that the installation via the Nvidia Device plugin might be incorrect, as I’m surprised that the GPU was identified. Maybe I am wrong.

Best regards,
Shakhizat

Hey Shakhizat, sorry for the delay - normally I don’t have the CUDA Toolkit samples in every CUDA container. You should just need to do this:

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/Samples/deviceQuery
make
./deviceQuery

or like this: jetson-containers/packages/cuda/cuda/Dockerfile.samples at master · dusty-nv/jetson-containers · GitHub

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.