Why can't I train with GPU after installing tensorflow?

strawberryluo · January 16, 2024, 3:28am

I installed tensorflow using the following command:
sudo apt-get install libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev liblapack-dev libblas-dev gfortran
sudo apt-get install python3-pip
sudo python3 -m pip install --upgrade pip
sudo pip3 install -U testresources setuptools==65.5.0
sudo pip3 install -U numpy==1.22 future==0.18.2 mock==3.0.5 keras_preprocessing==1.1.2 keras_applications==1.0.8 gast==0.4.0 protobuf pybind11 cython pkgconfig packaging h5py==3.7.0
sudo pip3 install --extra-index-url Index of /compute/redist/jp/v512 tensorflow==2.12.0+nv23.06

Command 【sudo python3 -c “import tensorflow as tf; print("Num GPUs Available: ", len(tf.config.list_physical_devices(‘GPU’)))”】 returns 【true】

When I start training my model, it prompts 【E tensorflow/core/grappler/optimizers/meta_optimizer.cc:1014] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape ingestureCNN/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer】, but the training is still going on and the GPU usage is basically 0. Although the GPU memory usage goes up, and I’m not sure if the training process is using the GPU.

Here is the information about the version of the software that I am using:
aarch64
Jetpack 5.1.1
Ubuntu 20.04
CUDA 11.4
cuDNN 8.6

AastaLLL · January 16, 2024, 6:24am

Hi,

This looks like a known issue of TensorFlow.
Could you check the below suggestion to see if it helps?

github.com/tensorflow/tensorflow

E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: size of values 0 does not match size of permutation 4.

opened 10:55PM - 21 Nov 19 UTC

closed 06:37AM - 12 Dec 19 UTC

wkdgnsgo

stat:awaiting response type:bug comp:grappler comp:ops TF 2.0

<em>Please make sure that this is a bug. As per our [GitHub Policy](https://gith…ub.com/tensorflow/tensorflow/blob/master/ISSUES.md), we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template</em> **System information** - Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 - Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: - TensorFlow installed from (source or binary): - TensorFlow version (use command below): 2.0 - Python version: 3.7 - Bazel version (if compiling from source): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: - GPU model and memory: RTX 2080Ti 11GB You can collect some of this information using our environment capture [script](https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh) You can also obtain the TensorFlow version with: 1. TF 1.0: `python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"` 2. TF 2.0: `python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"` **Describe the current behavior** Originally, I built this model in tensorflow 1.1x and I transferred the model to TF 2.0 manually to use tf.keras. It is working but it shows me this error message (E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: size of values 0 does not match size of permutation 4.) and its performance is worse than tf 1.1x. I suspect that this error interrupts to train somehow. I didn't put any permutation layer in my model. It is hard to find it. **Describe the expected behavior** **Code to reproduce the issue** Provide a reproducible test case that is the bare minimum necessary to generate the problem. **Other info / logs** Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Thanks.

strawberryluo · January 16, 2024, 7:29am

But I don‘t use tf.where, my code:

-- coding: UTF-8 --

import os
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator
import gesture_model as gm

确定使用的GPU

os.environ[‘CUDA_VISIBLE_DEVICES’]=‘0’

# 保存权重路径

checkpoint_path = “weights/gestureCNN_16_50/cp-{epoch:04d}.ckpt”
checkpoint_dir = os.path.dirname(checkpoint_path)

创建一个回调，每 5 个 epochs 保存模型的权重

cp_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_path,
verbose=1,
save_weights_only=True,
period=5)

sh = 96;

mymodel = gm.gestureCNN(input_shape=(sh, sh, 3), num_classes=3);
mymodel.summary();
parallel_model = mymodel;

epochs = 100;
batch_size = 32;

train_datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode=“nearest”,
validation_split=0.2)

validation_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
‘gesture_dataset/gesture_train’,
target_size=(sh, sh),
batch_size=batch_size,
class_mode=‘categorical’,#or binary
subset=‘training’)

validation_generator = train_datagen.flow_from_directory(
‘gesture_dataset/gesture_train’,
target_size=(sh, sh),
batch_size=batch_size,
class_mode=‘categorical’,#or binary
subset=‘validation’)

编译具体网络

parallel_model.compile(optimizer=‘adadelta’,
loss=‘categorical_crossentropy’,
metrics=[‘accuracy’])

parallel_model.fit_generator(train_generator,
validation_data=validation_generator,
steps_per_epoch=int(592/batch_size),
validation_steps=batch_size,
epochs=epochs,
callbacks=[cp_callback],)

AastaLLL · January 17, 2024, 6:31am

Hi,

Could you run a simple testing model to see if the training can run on GPU?

If GPU is only not used for your custom model, that should be an issue with TensorFlow implementation.
Then it’s recommended to check with the TensorFlow team to get better help.