Kubernetes Jetson Cluster and Tensorflow not recognizing all GPU memory

vcasado31 · January 20, 2022, 9:55pm

Hi,

I am building a cluster based on Jetsons, now three Nanos (4GB + 2x2GB) but soon I will add a few TX1/TX2.

I have set the Kubernetes cluster, as you can see here (the 4GB Nano is the Master)

dora@dora-desktop:~$ kubectl get nodes
NAME            STATUS   ROLES                  AGE   VERSION
dora-desktop    Ready    control-plane,master   31h   v1.23.1
dora1-desktop   Ready    worker                 30h   v1.23.1
dora2-desktop   Ready    worker                 30h   v1.23.1

I have pull the docker container l4t-tensorflow:r32.6.1-tf2.5-py3
I run the container with docker run, and when I try to get the number of GPUs / Memory with python and tensorflow, I get:

root@dora-desktop:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:01:44.841147: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.222173: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:01:50.236451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.236618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:01:50.236772: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.288994: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:01:50.289184: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:01:50.333929: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:01:50.374851: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:01:50.485243: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:01:50.521378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:01:50.523064: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:01:50.523720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

So I get only one GPU and a MemorySize of 3.87Gb, but it should report like 3 GPUs and ~7.6Gb of memory, doesn’t it?

If I mount a kubectl pod:

sudo vim l4t-tensorflow.yaml
apiVersion: v1
kind: Pod
metadata:
  name: l4t-tensorflow
spec:
  containers:
    - name: nvidia
      image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3
      command: [ "sleep" ]
      args: [ "1d" ]

kubectl apply -f l4t-tensorflow.yaml
kubectl exec -it l4t-tensorflow -- /bin/bash

I get:

root@l4t-tensorflow:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:33:27.803933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:36.898995: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:33:36.964329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:36.964530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 1.93GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:33:36.964612: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:37.118609: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:33:37.119229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:33:37.208444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:33:37.317166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:33:37.455693: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:33:37.543625: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:33:37.547713: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:33:37.548301: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.548901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.549240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

So with the pod I get the resources of a worker and with docker I get the resources of the master.

If I mount a kubectl deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: l4t-tensorflow
spec:
  selector:
    matchLabels:
      app: cluster
  replicas: 3 # tells deployment to run 3 pods matching the template
  template:
    metadata:
      labels:
        app: cluster
    spec:
      containers:
	  - name: nvidia
        image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3

I don’t know how to open a terminal for the deployment…

How can I set it to use all the resources of the cluster?

Thank you,
Best regards

AastaLLL · January 21, 2022, 3:20am

Hi,

It seems that Kubernetes cannot find the other two 2GB Nano.

We have a discussion about enabling the GPUs for the Nano cluster.
Would you mind giving it a check to see if it helps?

Thanks.

vcasado31 · January 21, 2022, 8:20am

Hi,

I checked that, but without succes

Thank you

vcasado31 · January 22, 2022, 9:58pm

Hi,
As in that post says, with kubectl run -i -t nvidia --image=jitteam/devicequery, I get:

Only the 4Gb Nano is recognised, but, if I try with: kubectl run nvidia --image=jitteam/devicequery --replicas=2
I get that --replicas is not supported:

dora@dora-desktop:~$ kubectl run nvidia --image=jitteam/devicequery --replicas=2
error: unknown flag: --replicas
See 'kubectl run --help' for usage.

Thank you

AastaLLL · January 25, 2022, 5:26am

Hi,

Which version do you use?
Please note that the --replicas is added from version 1.9.

https://v1-19.docs.kubernetes.io/docs/setup/release/notes/

Thanks.

vcasado31 · January 28, 2022, 9:49pm

Hi,
It worked with – replicas (with the space)
Should I be able to get from the master node query the resources of all the cluster? or that is something that the Kubernetes layer manage?
I mean, with the deviceQuery on the master, should I get 384 CUDA cores and 8Gb of ram, or only 128 cores and 4(or 2)?
Thank you

AastaLLL · February 17, 2022, 6:22am

Hi,

You should get 3 nodes and each node has its own resources.
Thanks.

vcasado31 · February 20, 2022, 4:34pm

Hi,

So, it is not working for me, I only get one node, with the resources of one node.

Thank you

AastaLLL · March 10, 2022, 7:26am

Thanks for your patince.

We are checking this internally.
Will share more information with you later.

Thanks.

AastaLLL · May 31, 2022, 3:51am

Hi,

Sorry for the late update.

Does the discussion shared above (topic149474) help on your issue?
Do you still need our support?

Thanks.

system · June 29, 2022, 2:02am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't find GPU in Kubernets on Jetson Nano cluster Jetson Nano nvbugs , neural-network-framework	27	3985	October 18, 2021
Is Tensorflow 2.0 on Jetson TX2 supported? Jetson TX2	19	4488	October 18, 2021
Jetson nano status: Internal: too many resources requested for launch Jetson Nano cuda , tensorflow	3	829	June 29, 2022
K3s on Jetson Nano 4GB with Jetson MATE \| (NVIDIA Container Toolkit) upgrade Jetson Nano containers , gpu	15	2575	September 25, 2023
Tensorflow not using GPU of Jetson nano Jetson Nano tensorflow	18	2785	March 5, 2023
Jetson Nano: Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory TensorRT	4	8077	August 3, 2022
Error while converting my model to a TensorRT model. Not found: Container TF-TRT does not exist. (Could not find resource: TF-TRT/TRTEngineOp_0_0) TensorRT tensorrt	1	2568	December 9, 2021
Nano with jetpack 4.3 can't find gpu with tensorflow 2.1 Jetson Nano tensorflow	9	1449	October 18, 2021
Query Regarding Memory Expansion for Jetson Nano Jetson Nano cuda , cudnn , jetson-nano	4	240	April 24, 2024
Tensorflow 1.15.5 can't sense GPU Jetson AGX Xavier tensorflow	6	1393	October 18, 2021

Kubernetes Jetson Cluster and Tensorflow not recognizing all GPU memory

Related topics