Hi,
I am building a cluster based on Jetsons, now three Nanos (4GB + 2x2GB) but soon I will add a few TX1/TX2.
I have set the Kubernetes cluster, as you can see here (the 4GB Nano is the Master)
dora@dora-desktop:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
dora-desktop Ready control-plane,master 31h v1.23.1
dora1-desktop Ready worker 30h v1.23.1
dora2-desktop Ready worker 30h v1.23.1
I have pull the docker container l4t-tensorflow:r32.6.1-tf2.5-py3
I run the container with docker run, and when I try to get the number of GPUs / Memory with python and tensorflow, I get:
root@dora-desktop:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:01:44.841147: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.222173: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:01:50.236451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.236618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:01:50.236772: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:01:50.288994: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:01:50.289184: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:01:50.333929: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:01:50.374851: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:01:50.485243: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:01:50.521378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:01:50.523064: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:01:50.523720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:01:50.524825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
So I get only one GPU and a MemorySize of 3.87Gb, but it should report like 3 GPUs and ~7.6Gb of memory, doesn’t it?
If I mount a kubectl pod:
sudo vim l4t-tensorflow.yaml
apiVersion: v1
kind: Pod
metadata:
name: l4t-tensorflow
spec:
containers:
- name: nvidia
image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3
command: [ "sleep" ]
args: [ "1d" ]
kubectl apply -f l4t-tensorflow.yaml
kubectl exec -it l4t-tensorflow -- /bin/bash
I get:
root@l4t-tensorflow:/# python3 -c "import tensorflow as tf;physical_devices = tf.config.list_physical_devices();print(physical_devices)"
2022-01-20 20:33:27.803933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:36.898995: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-20 20:33:36.964329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:36.964530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 1.93GiB deviceMemoryBandwidth: 194.55MiB/s
2022-01-20 20:33:36.964612: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2022-01-20 20:33:37.118609: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-20 20:33:37.119229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-20 20:33:37.208444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-20 20:33:37.317166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-20 20:33:37.455693: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-20 20:33:37.543625: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-20 20:33:37.547713: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-20 20:33:37.548301: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.548901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2022-01-20 20:33:37.549240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
So with the pod I get the resources of a worker and with docker I get the resources of the master.
If I mount a kubectl deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: l4t-tensorflow
spec:
selector:
matchLabels:
app: cluster
replicas: 3 # tells deployment to run 3 pods matching the template
template:
metadata:
labels:
app: cluster
spec:
containers:
- name: nvidia
image: nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3
I don’t know how to open a terminal for the deployment…
How can I set it to use all the resources of the cluster?
Thank you,
Best regards