Kubernetes Jetson Cluster and Tensorflow not recognizing all GPU memory

Hi,

I am building a cluster based on Jetsons, now three Nanos (4GB + 2x2GB) but soon I will add a few TX1/TX2.

I have set the Kubernetes cluster, as you can see here (the 4GB Nano is the Master)

dora@dora-desktop:~$ kubectl get nodes
NAME            STATUS   ROLES                  AGE   VERSION
dora-desktop    Ready    control-plane,master   31h   v1.23.1
dora1-desktop   Ready    worker                 30h   v1.23.1
dora2-desktop   Ready    worker                 30h   v1.23.1

I have pull the docker container l4t-tensorflow:r32.6.1-tf2.5-py3
I run the container with docker run, and when I try to get the number of GPUs / Memory with python and tensorflow, I get:

root@dora-desktop:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:01:44.841147: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.222173: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:01:50.236451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.236618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:01:50.236772: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.288994: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:01:50.289184: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:01:50.333929: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:01:50.374851: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:01:50.485243: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:01:50.521378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:01:50.523064: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:01:50.523720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

So I get only one GPU and a MemorySize of 3.87Gb, but it should report like 3 GPUs and ~7.6Gb of memory, doesn’t it?

If I mount a kubectl pod:

sudo vim l4t-tensorflow.yaml
apiVersion: v1
kind: Pod
metadata:
  name: l4t-tensorflow
spec:
  containers:
    - name: nvidia
      image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3
      command: [ "sleep" ]
      args: [ "1d" ]

kubectl apply -f l4t-tensorflow.yaml
kubectl exec -it l4t-tensorflow -- /bin/bash

I get:

root@l4t-tensorflow:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:33:27.803933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:36.898995: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:33:36.964329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:36.964530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 1.93GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:33:36.964612: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:37.118609: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:33:37.119229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:33:37.208444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:33:37.317166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:33:37.455693: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:33:37.543625: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:33:37.547713: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:33:37.548301: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.548901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.549240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

So with the pod I get the resources of a worker and with docker I get the resources of the master.

If I mount a kubectl deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: l4t-tensorflow
spec:
  selector:
    matchLabels:
      app: cluster
  replicas: 3 # tells deployment to run 3 pods matching the template
  template:
    metadata:
      labels:
        app: cluster
    spec:
      containers:
	  - name: nvidia
        image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3

I don’t know how to open a terminal for the deployment…

How can I set it to use all the resources of the cluster?

Thank you,
Best regards

Hi,

It seems that Kubernetes cannot find the other two 2GB Nano.

We have a discussion about enabling the GPUs for the Nano cluster.
Would you mind giving it a check to see if it helps?

Thanks.

Hi,

I checked that, but without succes

Thank you

Hi,
As in that post says, with kubectl run -i -t nvidia --image=jitteam/devicequery, I get:


Only the 4Gb Nano is recognised, but, if I try with: kubectl run nvidia --image=jitteam/devicequery --replicas=2
I get that --replicas is not supported:

dora@dora-desktop:~$ kubectl run nvidia --image=jitteam/devicequery --replicas=2
error: unknown flag: --replicas
See 'kubectl run --help' for usage.

Thank you

Hi,

Which version do you use?
Please note that the --replicas is added from version 1.9.

Thanks.

Hi,
It worked with – replicas (with the space)
Should I be able to get from the master node query the resources of all the cluster? or that is something that the Kubernetes layer manage?
I mean, with the deviceQuery on the master, should I get 384 CUDA cores and 8Gb of ram, or only 128 cores and 4(or 2)?
Thank you

Hi,

You should get 3 nodes and each node has its own resources.
Thanks.

Hi,

So, it is not working for me, I only get one node, with the resources of one node.

Thank you

Thanks for your patince.

We are checking this internally.
Will share more information with you later.

Thanks.