Kubernetes Jetson Cluster and Tensorflow not recognizing all GPU memory

Hi,

I am building a cluster based on Jetsons, now three Nanos (4GB + 2x2GB) but soon I will add a few TX1/TX2.

I have set the Kubernetes cluster, as you can see here (the 4GB Nano is the Master)

dora@dora-desktop:~$ kubectl get nodes
NAME            STATUS   ROLES                  AGE   VERSION
dora-desktop    Ready    control-plane,master   31h   v1.23.1
dora1-desktop   Ready    worker                 30h   v1.23.1
dora2-desktop   Ready    worker                 30h   v1.23.1

I have pull the docker container l4t-tensorflow:r32.6.1-tf2.5-py3
I run the container with docker run, and when I try to get the number of GPUs / Memory with python and tensorflow, I get:

root@dora-desktop:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:01:44.841147: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.222173: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:01:50.236451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.236618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:01:50.236772: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.288994: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:01:50.289184: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:01:50.333929: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:01:50.374851: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:01:50.485243: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:01:50.521378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:01:50.523064: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:01:50.523720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

So I get only one GPU and a MemorySize of 3.87Gb, but it should report like 3 GPUs and ~7.6Gb of memory, doesn’t it?

If I mount a kubectl pod:

sudo vim l4t-tensorflow.yaml
apiVersion: v1
kind: Pod
metadata:
  name: l4t-tensorflow
spec:
  containers:
    - name: nvidia
      image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3
      command: [ "sleep" ]
      args: [ "1d" ]

kubectl apply -f l4t-tensorflow.yaml
kubectl exec -it l4t-tensorflow -- /bin/bash

I get:

root@l4t-tensorflow:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:33:27.803933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:36.898995: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:33:36.964329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:36.964530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 1.93GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:33:36.964612: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:37.118609: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:33:37.119229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:33:37.208444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:33:37.317166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:33:37.455693: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:33:37.543625: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:33:37.547713: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:33:37.548301: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.548901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.549240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

So with the pod I get the resources of a worker and with docker I get the resources of the master.

If I mount a kubectl deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: l4t-tensorflow
spec:
  selector:
    matchLabels:
      app: cluster
  replicas: 3 # tells deployment to run 3 pods matching the template
  template:
    metadata:
      labels:
        app: cluster
    spec:
      containers:
	  - name: nvidia
        image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3

I don’t know how to open a terminal for the deployment…

How can I set it to use all the resources of the cluster?

Thank you,
Best regards

Hi,

It seems that Kubernetes cannot find the other two 2GB Nano.

We have a discussion about enabling the GPUs for the Nano cluster.
Would you mind giving it a check to see if it helps?

Thanks.

Hi,

I checked that, but without succes

Thank you

Hi,
As in that post says, with kubectl run -i -t nvidia --image=jitteam/devicequery, I get:


Only the 4Gb Nano is recognised, but, if I try with: kubectl run nvidia --image=jitteam/devicequery --replicas=2
I get that --replicas is not supported:

dora@dora-desktop:~$ kubectl run nvidia --image=jitteam/devicequery --replicas=2
error: unknown flag: --replicas
See 'kubectl run --help' for usage.

Thank you

Hi,

Which version do you use?
Please note that the --replicas is added from version 1.9.

https://v1-19.docs.kubernetes.io/docs/setup/release/notes/

Thanks.

Hi,
It worked with – replicas (with the space)
Should I be able to get from the master node query the resources of all the cluster? or that is something that the Kubernetes layer manage?
I mean, with the deviceQuery on the master, should I get 384 CUDA cores and 8Gb of ram, or only 128 cores and 4(or 2)?
Thank you

Hi,

You should get 3 nodes and each node has its own resources.
Thanks.

Hi,

So, it is not working for me, I only get one node, with the resources of one node.

Thank you

Thanks for your patince.

We are checking this internally.
Will share more information with you later.

Thanks.

Hi,

Sorry for the late update.

Does the discussion shared above (topic149474) help on your issue?
Do you still need our support?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.