We made a customized jetson SoM cluster. However, while we are building the Kubernetes demo, we met a problem. We can find GPU on both master and worker nodes while using docker directly. But in K8s pods, only GPU from the master node can be found. Do you know how to fix it?
The devices are connected through a on-board 5-port switch. The primary GPU is the gpu on master node.
I’m following this link. Below is the yml I’m using.
Just want to clarify first.
This issue is that the secondary GPU cannot be found with kubectl run but works well with docker run.
Is our understanding correct?
I think connecting 4 jetson nano to a router and building the K8s cluster on it will reproduce the problem. All hardware are working fine seperately. We think the problem comes from Nvidia’s plugin for Kubernetes.
Installing Kubernetes
during this step, when you execute command :sudo kubeadm init --pod-network-cidr=192.168.0.0/16, you should copy the output info like this to add jetson node to the cluster:
But can you read GPU from each node. Currently I can successfully add all 4 jetson nano to my cluster. But I can only read 1 CUDA device from the master node. Can you read CUDA device on each node?
hi memeoryleakyu:
since in my cluster ,master node is VM x86. I will create 2 jetson node cluster(master/worker) to reproduce this issue and check the GPU info
hi memoryleakyu:
I created two nodes cluster:master(xavier), worker(nx),and I just install plugin in cluster, now describe node , info of nx as belows:
jet@jetson:~$ sudo kubectl get node
NAME STATUS ROLES AGE VERSION
jetson Ready master 4d5h v1.18.8+k3s1
jetson-qqq Ready worker 4d5h v1.18.8+k3s1
peterjetson1 Ready worker 4d5h v1.18.8+k3s1
qqq-jetson Ready worker 4d5h v1.18.8+k3s1
jet@jetson:~$ sudo kubectl logs devicequery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3956 MBytes (4148391936 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host
And I only got 1 cuda device.
Can you share which plugin did you use? I’ll check if your plugin works.
when I execute command on master node:kubectl run -i -t nvidia --image=jitteam/devicequery to deploy this image, we can see it successfully run on NX(name is xavier, from cuda cores 384, this is run on NX)
I also found kubernetes support for the ARM64 architecture, which is still under review and not merged.
Please also help to confirm whether the current software can do some verification work.We used the Jetson module to build a high-performance Jetson Cluster.