Can't find GPU in Kubernets on Jetson Nano cluster

Hi, guys

We made a customized jetson SoM cluster. However, while we are building the Kubernetes demo, we met a problem. We can find GPU on both master and worker nodes while using docker directly. But in K8s pods, only GPU from the master node can be found. Do you know how to fix it?

Any information will be appreciated.

Hi,

Would you mind to share more information about your customer board?

How do you link the Jetson Nano boards?
And how do you decide which one to be the primary GPU?

Thanks

The devices are connected through a on-board 5-port switch. The primary GPU is the gpu on master node.
I’m following this link. Below is the yml I’m using.

jet@jetson:~$ cat gpu-test.yml 
apiVersion: v1
kind: Pod
metadata:
  name: devicequery
spec:
  containers:
    - name: nvidia
      image: jitteam/devicequery:latest
      command: [ "./deviceQuery" ]

Hi,

Thanks for your information.

Just want to clarify first.
This issue is that the secondary GPU cannot be found with kubectl run but works well with docker run.
Is our understanding correct?

Thanks.

Thanks for the reply. Yes, your understanding is correct.

Thanks.

Let us discuss this internally and reply you later.

Hi, do you have any idea what caused this problem? Any information will be appreciated.

Hi,

Just discuss this internally.

Do you know how could we reproduce or simulate your case in our environment.
This will help us find out the root cause.

Thanks.

I think connecting 4 jetson nano to a router and building the K8s cluster on it will reproduce the problem. All hardware are working fine seperately. We think the problem comes from Nvidia’s plugin for Kubernetes.

hi memoryleakyu:
K3s is not tested from NV to create jetson cluster. please follow the guide with K8s directly to do it

1)create master node
env:ubuntu18.04(16.04 not support)
install docker & K8s

  • Installing Docker-CE
  • Installing Kubernetes
    during this step, when you execute command :sudo kubeadm init --pod-network-cidr=192.168.0.0/16, you should copy the output info like this to add jetson node to the cluster:
 $ kubeadm join 192.168.0.150:6443 --token lvyap6.9fqi7j7zvfqkmjmo --discovery-token-ca-cert-hash sha256:73cea3b17042e88de24d33e8bba7ee5d90b49b71cb4aec3dfacf74b4fd5d52ac
after these two steps, you should create master node like this

2)create jetson node

install each jetson env (nano/tx2/xavier/xavier nx) with same step as followed

  • just use jetson native docker first,I think it should be ok(I tested install docker same step with master)

  • install K8s same step with master, but be notice only excute command stop at


    the command from kubeadm init and all the other no need to excute, because these commands that create cluster just need run on master only

  • Set user on each jetson Docker
    sudo groupadd docker
    sudo usermod -aG docker $USER
    newgrp docker

  • Add each jetson node to cluster(master node)
    command like this you copied from init master above

 $ kubeadm join 192.168.0.150:6443 --token lvyap6.9fqi7j7zvfqkmjmo --discovery-token-ca-cert-hash sha256:73cea3b17042e88de24d33e8bba7ee5d90b49b71cb4aec3dfacf74b4fd5d52ac**
  • check all nodes on master
 $  kubectl get nodes

you should see the output like this ,all jetson nodes have been added to the cluster

1 Like

Hi Jeffli, Thanks for the detailed reply,

But can you read GPU from each node. Currently I can successfully add all 4 jetson nano to my cluster. But I can only read 1 CUDA device from the master node. Can you read CUDA device on each node?

hi memeoryleakyu:
since in my cluster ,master node is VM x86. I will create 2 jetson node cluster(master/worker) to reproduce this issue and check the GPU info

Thanks so much for your patience! I’ll wait for your update.

hi memoryleakyu:
I created two nodes cluster:master(xavier), worker(nx),and I just install plugin in cluster, now describe node , info of nx as belows:

is this same with you ? how you test work node of GPU info, I can repeat your operation to check.

Hi Jeffli

I tested the following way:

jet@jetson:~$ sudo kubectl get node
NAME           STATUS   ROLES    AGE    VERSION
jetson         Ready    master   4d5h   v1.18.8+k3s1
jetson-qqq     Ready    worker   4d5h   v1.18.8+k3s1
peterjetson1   Ready    worker   4d5h   v1.18.8+k3s1
qqq-jetson     Ready    worker   4d5h   v1.18.8+k3s1
jet@jetson:~$ sudo kubectl logs devicequery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3956 MBytes (4148391936 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host

And I only got 1 cuda device.
Can you share which plugin did you use? I’ll check if your plugin works.

hi memeoryleakyu:
try this link
https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html#kubernetes-beforebegin

when I execute command on master node:kubectl run -i -t nvidia --image=jitteam/devicequery to deploy this image, we can see it successfully run on NX(name is xavier, from cuda cores 384, this is run on NX)

Hi @Jeffli
I work with @memoryleakyu .
Thanks for sharing. I try to follow this document. However, the document content is outdated.
https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html#kubernetes-beforebegin
It cannot be done at this stage. Take a closer look inside.


There is no file in the system.I also tried to look online, but I couldn’t find the content.

kubectl apply -f /etc/kubeadm/device-plugin/nvidia-1.9.10.yml

I also found kubernetes support for the ARM64 architecture, which is still under review and not merged.

Please also help to confirm whether the current software can do some verification work.We used the Jetson module to build a high-performance Jetson Cluster.

hi baozhu.zuo:
what is meaning “but I couldn’t find the content.”
this is plugin yml
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

@Jeffli
I mean we can’t find this file in the system.
/etc/kubeadm/device-plugin/nvidia-1.9.10.yml

I’ve also tried the V0.6.0 branch. But it still doesn’t work.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml

Is there a big difference between V0.6.0 Branch and 1.0.0-beta Tag? Does the non-merged PR matter?