Description
I have a Lambda workstation and access to my university’s high-performance computing cluster (HPRC). On both devices I ran the same (copied verbatim) code as in the tutorial: Custom training with tf.distribute.Strategy | TensorFlow Core as well as my own code for research and some very basic dummy code from ChatGPT. All 3 cases resulted in the same error on both systems and I am lost. I have worked on this issue for several days and it is either something incredibly simple that I have overlooked or is something deeper.
Environment
TensorRT Version:
GPU Type: 2x Nvidia Titian RTX (TU102) – Lambda, 2x Nvidia T4 – HPRC
Nvidia Driver Version: 535.216.01
CUDA Version: 12.2
CUDNN Version:
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): 3.10.8
TensorFlow Version (if applicable): 2.18.0
Relevant Files
OUTPUT (STDOUT):
Total CPUs available on node: 10
CPUs allocated to this job: 10
Number of tasks: 1
Number of CPUs per task: 10
Number of GPUs:
Sat Jan 25 01:30:47 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:1F:00.0 Off | Off |
| N/A 26C P8 8W / 70W | 2MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:20:00.0 Off | Off |
| N/A 25C P8 9W / 70W | 2MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
2.18.0
Training data shape: (60000, 28, 28)
Testing data shape: (10000, 28, 28)
Number of classes: 10
Number of devices: 2
**OUTPUT ERR (Same for Lambda workstation and HPRC): **
2025-01-25 01:30:49.918291: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-25 01:30:49.938318: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1737790249.954884 1055788 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1737790249.960092 1055788 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-25 01:30:49.978543: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1737790264.582581 1055788 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14793 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:1f:00.0, compute capability: 7.5
I0000 00:00:1737790264.584774 1055788 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14793 MB memory: -> device: 1, name: Tesla T4, pci bus id: 0000:20:00.0, compute capability: 7.5
2025-01-25 01:31:11.324458: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
[[{{function_node cond_false_2736}}{{node cond/Placeholder_1}}]]
2025-01-25 01:31:11.347563: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5625148062850072845
2025-01-25 01:31:11.347585: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
[[{{function_node cond_false_2736}}{{node cond/Placeholder_1}}]]
[[cond/else/_1/cond/StatefulPartitionedCall/sequential/dense/bias/replica_1/Initializer/ReadVariableOp/_276]]
2025-01-25 01:31:11.347602: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8555087466618012391
2025-01-25 01:31:11.347614: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4257902694985024147
2025-01-25 01:31:11.347627: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14608804141965288958
2025-01-25 01:31:11.347639: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15927432988125128143
2025-01-25 01:31:11.347651: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8663450026482124690
2025-01-25 01:31:11.347662: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7085550370543135447
2025-01-25 01:31:11.347674: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3680318016679938919
2025-01-25 01:31:11.347689: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 61948576973276375
2025-01-25 01:31:11.347703: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5555919262803446006
2025-01-25 01:31:11.347716: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15865134488141861156
2025-01-25 01:31:11.347728: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8460981799772805122
2025-01-25 01:31:11.347749: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15543940922102634986
2025-01-25 01:31:11.347761: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2895930610002074558
2025-01-25 01:31:11.347773: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17570933227477786880
2025-01-25 01:31:11.347784: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5919844456130389220
2025-01-25 01:31:11.347797: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16975704230081930687
2025-01-25 01:31:11.347809: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8109886916305365893
2025-01-25 01:31:11.347820: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6476094244471729709
2025-01-25 01:31:11.347831: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9295737380244238169
2025-01-25 01:31:11.347843: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2898683104399241375
2025-01-25 01:31:11.347854: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2966999829932423339
2025-01-25 01:31:11.347865: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17137261018072136309
2025-01-25 01:31:11.347876: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 10414561108457225093
2025-01-25 01:31:11.347888: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3413607217365302498
2025-01-25 01:31:11.347899: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17637803994361511393
2025-01-25 01:31:11.347910: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9525546084729208094
2025-01-25 01:31:11.347921: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17393875828527742835
2025-01-25 01:31:11.347944: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16440189294096419911
2025-01-25 01:31:11.347955: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14502522511765949917
2025-01-25 01:31:11.347966: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2667214724226428251
2025-01-25 01:31:11.347977: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5312429128534109776
2025-01-25 01:31:11.347989: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11897032709771071501
2025-01-25 01:31:11.348000: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3436629763738349653
2025-01-25 01:31:11.348012: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9838178340502606432
2025-01-25 01:31:11.348023: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6126455883701091058
2025-01-25 01:31:11.348034: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6960002517158107921
2025-01-25 01:31:11.348046: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 13458666422974938310
2025-01-25 01:31:11.348056: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6857845065607888459
2025-01-25 01:31:11.348072: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4161867765326249590
2025-01-25 01:31:11.348084: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 18404293459461964140
2025-01-25 01:31:11.348095: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15814469932567725801
2025-01-25 01:31:11.348106: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5599047678481975791
2025-01-25 01:31:11.348118: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14974509300267404124
2025-01-25 01:31:11.348129: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7687783163133922272
2025-01-25 01:31:11.348140: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16083789929501595258
2025-01-25 01:31:11.348151: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11690995037237271348
2025-01-25 01:31:11.348162: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 10561186424980594992
2025-01-25 01:31:11.348173: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14849095089064960506
2025-01-25 01:31:11.348184: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7778400796381210710
2025-01-25 01:31:11.348195: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7081315524661078590
2025-01-25 01:31:11.348206: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 10598361524780887974
2025-01-25 01:31:11.348217: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2274208202104042548
2025-01-25 01:31:11.348235: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9757054949723796930
2025-01-25 01:31:11.348247: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3344869054201509177
2025-01-25 01:31:11.348258: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 1280257779711375201
2025-01-25 01:31:11.348270: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14633381437815830610
2025-01-25 01:31:11.348281: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14704561858977232833
2025-01-25 01:31:11.348292: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11650458017449146117
2025-01-25 01:31:11.348315: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 1192835772799123431
Traceback (most recent call last):
File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>
total_loss += distributed_train_step(x)
File "/scratch/user/apc5/tf-gpu/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/scratch/user/apc5/tf-gpu/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node cond/Placeholder_1 defined at (most recent call last):
File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>
Detected at node cond/Placeholder_1 defined at (most recent call last):
File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>
Detected at node cond/Placeholder_1 defined at (most recent call last):
File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>
3 root error(s) found.
(0) INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
[[{{node cond/Placeholder_1}}]]
[[cond/else/_1/cond/StatefulPartitionedCall/sequential/dense/bias/replica_1/Initializer/ReadVariableOp/_276]]
[[cond/then/_0/cond/StatefulPartitionedCall/add/_306]]
(1) INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
[[{{node cond/Placeholder_1}}]]
[[cond/else/_1/cond/StatefulPartitionedCall/sequential/dense/bias/replica_1/Initializer/ReadVariableOp/_276]]
(2) INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
[[{{node cond/Placeholder_1}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_fn_with_cond_4275]
CODE:
# Import TensorFlow
import tensorflow as tf
# Helper libraries
import numpy as np
import os
print(tf.__version__)
# --- START HPRC ---
# def load_local_fashion_mnist(data_dir):
# """Load Fashion MNIST dataset from local files."""
# def load_images(file_path):
# with open(file_path, 'rb') as f:
# _ = f.read(16) # Skip header
# data = np.frombuffer(f.read(), dtype=np.uint8)
# return data.reshape(-1, 28, 28)
# def load_labels(file_path):
# with open(file_path, 'rb') as f:
# _ = f.read(8) # Skip header
# data = np.frombuffer(f.read(), dtype=np.uint8)
# return data
# train_images = load_images(os.path.join(data_dir, 'train-images-idx3-ubyte'))
# train_labels = load_labels(os.path.join(data_dir, 'train-labels-idx1-ubyte'))
# test_images = load_images(os.path.join(data_dir, 't10k-images-idx3-ubyte'))
# test_labels = load_labels(os.path.join(data_dir, 't10k-labels-idx1-ubyte'))
# return (train_images, train_labels), (test_images, test_labels)
# # Path to the directory containing the dataset files
# data_directory = "/scratch/user/apc5/fashion/"
# # Load the dataset
# (train_images, train_labels), (test_images, test_labels) = load_local_fashion_mnist(data_directory)
# # Verify dataset loading
# print(f"Training data shape: {train_images.shape}")
# print(f"Testing data shape: {test_images.shape}")
# print(f"Number of classes: {len(set(train_labels))}")
# --- END HPRC ---
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
# Add a dimension to the array -> new shape == (28, 28, 1)
# This is done because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]
# Scale the images to the [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)
# If the list of devices is not specified in
# `tf.distribute.MirroredStrategy` constructor, they will be auto-detected.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
BUFFER_SIZE = len(train_images)
BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
EPOCHS = 10
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)
def create_model():
regularizer = tf.keras.regularizers.L2(1e-5)
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3,
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, 3,
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64,
activation='relu',
kernel_regularizer=regularizer),
tf.keras.layers.Dense(10, kernel_regularizer=regularizer)
])
return model
with strategy.scope():
# Set reduction to `NONE` so you can do the reduction yourself.
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True,
reduction=tf.keras.losses.Reduction.NONE)
def compute_loss(labels, predictions, model_losses):
per_example_loss = loss_object(labels, predictions)
loss = tf.nn.compute_average_loss(per_example_loss)
if model_losses:
loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
return loss
with strategy.scope():
test_loss = tf.keras.metrics.Mean(name='test_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='train_accuracy')
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='test_accuracy')
# A model, an optimizer, and a checkpoint must be created under `strategy.scope`.
with strategy.scope():
model = create_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
def train_step(inputs):
images, labels = inputs
with tf.GradientTape() as tape:
predictions = model(images, training=True)
loss = compute_loss(labels, predictions, model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_accuracy.update_state(labels, predictions)
return loss
def test_step(inputs):
images, labels = inputs
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
test_loss.update_state(t_loss)
test_accuracy.update_state(labels, predictions)
# `run` replicates the provided computation and runs it
# with the distributed input.
@tf.function
def distributed_train_step(dataset_inputs):
per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)
@tf.function
def distributed_test_step(dataset_inputs):
return strategy.run(test_step, args=(dataset_inputs,))
for epoch in range(EPOCHS):
# TRAIN LOOP
total_loss = 0.0
num_batches = 0
for x in train_dist_dataset:
total_loss += distributed_train_step(x)
num_batches += 1
train_loss = total_loss / num_batches
# TEST LOOP
for x in test_dist_dataset:
distributed_test_step(x)
template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
"Test Accuracy: {}")
print(template.format(epoch + 1, train_loss,
train_accuracy.result() * 100, test_loss.result(),
test_accuracy.result() * 100))
test_loss.reset_states()
train_accuracy.reset_states()
test_accuracy.reset_states()
Steps To Reproduce
- module load GCCcore/12.2.0 Python/3.10.8 # ← Can be ignored if not running on HPRC
- python3 -m venv --system-site-packages ./tf-gpu
- source tf-gpu/bin/activate
- python3 -m pip install tensorflow[and-cuda]
- python3 test_code.py