Can't find GPU in Kubernets on Jetson Nano cluster

memoryleakyu · August 31, 2020, 2:20am

Hi, guys

We made a customized jetson SoM cluster. However, while we are building the Kubernetes demo, we met a problem. We can find GPU on both master and worker nodes while using docker directly. But in K8s pods, only GPU from the master node can be found. Do you know how to fix it?

Any information will be appreciated.

AastaLLL · August 31, 2020, 11:31am

Hi,

Would you mind to share more information about your customer board?

How do you link the Jetson Nano boards?
And how do you decide which one to be the primary GPU?

Thanks

memoryleakyu · September 1, 2020, 12:46am

The devices are connected through a on-board 5-port switch. The primary GPU is the gpu on master node.
I’m following this link. Below is the yml I’m using.

jet@jetson:~$ cat gpu-test.yml 
apiVersion: v1
kind: Pod
metadata:
  name: devicequery
spec:
  containers:
    - name: nvidia
      image: jitteam/devicequery:latest
      command: [ "./deviceQuery" ]

AastaLLL · September 1, 2020, 2:41am

Hi,

Thanks for your information.

Just want to clarify first.
This issue is that the secondary GPU cannot be found with kubectl run but works well with docker run.
Is our understanding correct?

Thanks.

memoryleakyu · September 1, 2020, 2:48am

Thanks for the reply. Yes, your understanding is correct.

AastaLLL · September 1, 2020, 2:49am

Thanks.

Let us discuss this internally and reply you later.

memoryleakyu · September 2, 2020, 6:33am

Hi, do you have any idea what caused this problem? Any information will be appreciated.

AastaLLL · September 2, 2020, 8:15am

Hi,

Just discuss this internally.

Do you know how could we reproduce or simulate your case in our environment.
This will help us find out the root cause.

Thanks.

memoryleakyu · September 2, 2020, 8:20am

I think connecting 4 jetson nano to a router and building the K8s cluster on it will reproduce the problem. All hardware are working fine seperately. We think the problem comes from Nvidia’s plugin for Kubernetes.

Jeffli · September 2, 2020, 2:11pm

hi memoryleakyu:
K3s is not tested from NV to create jetson cluster. please follow the guide with K8s directly to do it

1)create master node
env:ubuntu18.04(16.04 not support)
install docker & K8s

Installing Docker-CE
Installing Kubernetes
during this step, when you execute command :sudo kubeadm init --pod-network-cidr=192.168.0.0/16, you should copy the output info like this to add jetson node to the cluster:

 $ kubeadm join 192.168.0.150:6443 --token lvyap6.9fqi7j7zvfqkmjmo --discovery-token-ca-cert-hash sha256:73cea3b17042e88de24d33e8bba7ee5d90b49b71cb4aec3dfacf74b4fd5d52ac

after these two steps, you should create master node like this

2)create jetson node

install each jetson env (nano/tx2/xavier/xavier nx) with same step as followed

just use jetson native docker first,I think it should be ok(I tested install docker same step with master)
install K8s same step with master, but be notice only excute command stop at

image2219×377 38 KB

the command from kubeadm init and all the other no need to excute, because these commands that create cluster just need run on master only

image1122×266 14.8 KB
Set user on each jetson Docker
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
Add each jetson node to cluster(master node)
command like this you copied from init master above

 $ kubeadm join 192.168.0.150:6443 --token lvyap6.9fqi7j7zvfqkmjmo --discovery-token-ca-cert-hash sha256:73cea3b17042e88de24d33e8bba7ee5d90b49b71cb4aec3dfacf74b4fd5d52ac**

check all nodes on master

 $  kubectl get nodes

you should see the output like this ,all jetson nodes have been added to the cluster

memoryleakyu · September 3, 2020, 5:45am

Hi Jeffli, Thanks for the detailed reply,

But can you read GPU from each node. Currently I can successfully add all 4 jetson nano to my cluster. But I can only read 1 CUDA device from the master node. Can you read CUDA device on each node?

Jeffli · September 3, 2020, 6:11am

hi memeoryleakyu:
since in my cluster ,master node is VM x86. I will create 2 jetson node cluster(master/worker) to reproduce this issue and check the GPU info

memoryleakyu · September 3, 2020, 7:17am

Thanks so much for your patience! I’ll wait for your update.

Jeffli · September 5, 2020, 4:36pm

hi memoryleakyu：
I created two nodes cluster:master(xavier), worker(nx),and I just install plugin in cluster, now describe node , info of nx as belows:

is this same with you ? how you test work node of GPU info, I can repeat your operation to check.

memoryleakyu · September 5, 2020, 4:46pm

Hi Jeffli

I tested the following way:

jet@jetson:~$ sudo kubectl get node
NAME           STATUS   ROLES    AGE    VERSION
jetson         Ready    master   4d5h   v1.18.8+k3s1
jetson-qqq     Ready    worker   4d5h   v1.18.8+k3s1
peterjetson1   Ready    worker   4d5h   v1.18.8+k3s1
qqq-jetson     Ready    worker   4d5h   v1.18.8+k3s1
jet@jetson:~$ sudo kubectl logs devicequery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3956 MBytes (4148391936 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host

And I only got 1 cuda device.
Can you share which plugin did you use? I’ll check if your plugin works.

Jeffli · September 6, 2020, 5:05am

hi memeoryleakyu:
try this link

Jeffli · September 6, 2020, 6:00am

when I execute command on master node:kubectl run -i -t nvidia --image=jitteam/devicequery to deploy this image, we can see it successfully run on NX(name is xavier, from cuda cores 384, this is run on NX)

baozhu.zuo · September 9, 2020, 1:37am

Hi @Jeffli
I work with @memoryleakyu .
Thanks for sharing. I try to follow this document. However, the document content is outdated.

It cannot be done at this stage. Take a closer look inside.

There is no file in the system.I also tried to look online, but I couldn’t find the content.

kubectl apply -f /etc/kubeadm/device-plugin/nvidia-1.9.10.yml

I also found kubernetes support for the ARM64 architecture, which is still under review and not merged.

Please also help to confirm whether the current software can do some verification work.We used the Jetson module to build a high-performance Jetson Cluster.

Jeffli · September 10, 2020, 2:37am

hi baozhu.zuo:
what is meaning “but I couldn’t find the content.”
this is plugin yml
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

baozhu.zuo · September 11, 2020, 7:29am

@Jeffli
I mean we can’t find this file in the system.
/etc/kubeadm/device-plugin/nvidia-1.9.10.yml

I’ve also tried the V0.6.0 branch. But it still doesn’t work.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml

Is there a big difference between V0.6.0 Branch and 1.0.0-beta Tag? Does the non-merged PR matter?

Topic		Replies	Views
Kubernetes Jetson Cluster and Tensorflow not recognizing all GPU memory Jetson Nano jetson-inference , nvbugs	10	1597	June 29, 2022
K3s on Jetson Nano 4GB with Jetson MATE \| (NVIDIA Container Toolkit) upgrade Jetson Nano containers , gpu	15	3059	September 25, 2023
Devicequery in jetson nana works for docker but not for kubernetes Jetson Nano docker	6	1633	October 15, 2021
Jetson Orin Nano Dev Board Pods Stuck in ContainersCreating State Jetson Orin Nano docker , kubernetes	7	314	July 30, 2024
Is it possible to virtualize GPU on Jetson agx xavier with k8s cluster Jetson AGX Xavier docker	2	623	August 25, 2022
Kubernetes Cluster on top of AGX Xavier Jetsons with Jepatck 4.6.1 Jetson AGX Xavier cuda	8	1260	July 12, 2023
Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit Jetson AGX Orin gpu , kubernetes	15	364	April 20, 2025
How can I use nvidia gpu in kubernetes pod? Jetson Xavier NX jetson-inference	4	9015	September 21, 2022
GPU Operator support Jetson Nano docker , kubernetes	3	850	February 21, 2023
Cann't use gpu resources in containerd with k8s and orin nano Jetson Orin Nano gpu-computing	3	336	August 5, 2024

Can't find GPU in Kubernets on Jetson Nano cluster

Related topics