Tensorflow 2.18.0 MirroredStrategy Fail to Train with Multiple GPUs

apc5 · January 25, 2025, 8:01am

Description

I have a Lambda workstation and access to my university’s high-performance computing cluster (HPRC). On both devices I ran the same (copied verbatim) code as in the tutorial: Custom training with tf.distribute.Strategy | TensorFlow Core as well as my own code for research and some very basic dummy code from ChatGPT. All 3 cases resulted in the same error on both systems and I am lost. I have worked on this issue for several days and it is either something incredibly simple that I have overlooked or is something deeper.

Environment

TensorRT Version:
GPU Type: 2x Nvidia Titian RTX (TU102) – Lambda, 2x Nvidia T4 – HPRC
Nvidia Driver Version: 535.216.01
CUDA Version: 12.2
CUDNN Version:
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): 3.10.8
TensorFlow Version (if applicable): 2.18.0

Relevant Files

OUTPUT (STDOUT):

Total CPUs available on node: 10
CPUs allocated to this job: 10
Number of tasks: 1
Number of CPUs per task: 10
Number of GPUs:
Sat Jan 25 01:30:47 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:1F:00.0 Off |                  Off |
| N/A   26C    P8               8W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:20:00.0 Off |                  Off |
| N/A   25C    P8               9W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
2.18.0
Training data shape: (60000, 28, 28)
Testing data shape: (10000, 28, 28)
Number of classes: 10
Number of devices: 2

**OUTPUT ERR (Same for Lambda workstation and HPRC): **

2025-01-25 01:30:49.918291: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-25 01:30:49.938318: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1737790249.954884 1055788 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1737790249.960092 1055788 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-25 01:30:49.978543: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1737790264.582581 1055788 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14793 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:1f:00.0, compute capability: 7.5
I0000 00:00:1737790264.584774 1055788 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14793 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:20:00.0, compute capability: 7.5
2025-01-25 01:31:11.324458: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
	 [[{{function_node cond_false_2736}}{{node cond/Placeholder_1}}]]
2025-01-25 01:31:11.347563: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5625148062850072845
2025-01-25 01:31:11.347585: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
	 [[{{function_node cond_false_2736}}{{node cond/Placeholder_1}}]]
	 [[cond/else/_1/cond/StatefulPartitionedCall/sequential/dense/bias/replica_1/Initializer/ReadVariableOp/_276]]
2025-01-25 01:31:11.347602: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8555087466618012391
2025-01-25 01:31:11.347614: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4257902694985024147
2025-01-25 01:31:11.347627: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14608804141965288958
2025-01-25 01:31:11.347639: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15927432988125128143
2025-01-25 01:31:11.347651: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8663450026482124690
2025-01-25 01:31:11.347662: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7085550370543135447
2025-01-25 01:31:11.347674: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3680318016679938919
2025-01-25 01:31:11.347689: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 61948576973276375
2025-01-25 01:31:11.347703: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5555919262803446006
2025-01-25 01:31:11.347716: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15865134488141861156
2025-01-25 01:31:11.347728: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8460981799772805122
2025-01-25 01:31:11.347749: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15543940922102634986
2025-01-25 01:31:11.347761: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2895930610002074558
2025-01-25 01:31:11.347773: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17570933227477786880
2025-01-25 01:31:11.347784: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5919844456130389220
2025-01-25 01:31:11.347797: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16975704230081930687
2025-01-25 01:31:11.347809: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 8109886916305365893
2025-01-25 01:31:11.347820: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6476094244471729709
2025-01-25 01:31:11.347831: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9295737380244238169
2025-01-25 01:31:11.347843: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2898683104399241375
2025-01-25 01:31:11.347854: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2966999829932423339
2025-01-25 01:31:11.347865: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17137261018072136309
2025-01-25 01:31:11.347876: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 10414561108457225093
2025-01-25 01:31:11.347888: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3413607217365302498
2025-01-25 01:31:11.347899: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17637803994361511393
2025-01-25 01:31:11.347910: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9525546084729208094
2025-01-25 01:31:11.347921: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 17393875828527742835
2025-01-25 01:31:11.347944: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16440189294096419911
2025-01-25 01:31:11.347955: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14502522511765949917
2025-01-25 01:31:11.347966: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2667214724226428251
2025-01-25 01:31:11.347977: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5312429128534109776
2025-01-25 01:31:11.347989: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11897032709771071501
2025-01-25 01:31:11.348000: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3436629763738349653
2025-01-25 01:31:11.348012: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9838178340502606432
2025-01-25 01:31:11.348023: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6126455883701091058
2025-01-25 01:31:11.348034: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6960002517158107921
2025-01-25 01:31:11.348046: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 13458666422974938310
2025-01-25 01:31:11.348056: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 6857845065607888459
2025-01-25 01:31:11.348072: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4161867765326249590
2025-01-25 01:31:11.348084: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 18404293459461964140
2025-01-25 01:31:11.348095: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15814469932567725801
2025-01-25 01:31:11.348106: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5599047678481975791
2025-01-25 01:31:11.348118: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14974509300267404124
2025-01-25 01:31:11.348129: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7687783163133922272
2025-01-25 01:31:11.348140: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16083789929501595258
2025-01-25 01:31:11.348151: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11690995037237271348
2025-01-25 01:31:11.348162: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 10561186424980594992
2025-01-25 01:31:11.348173: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14849095089064960506
2025-01-25 01:31:11.348184: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7778400796381210710
2025-01-25 01:31:11.348195: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 7081315524661078590
2025-01-25 01:31:11.348206: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 10598361524780887974
2025-01-25 01:31:11.348217: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2274208202104042548
2025-01-25 01:31:11.348235: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 9757054949723796930
2025-01-25 01:31:11.348247: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3344869054201509177
2025-01-25 01:31:11.348258: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 1280257779711375201
2025-01-25 01:31:11.348270: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14633381437815830610
2025-01-25 01:31:11.348281: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 14704561858977232833
2025-01-25 01:31:11.348292: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11650458017449146117
2025-01-25 01:31:11.348315: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 1192835772799123431
Traceback (most recent call last):
  File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>
    total_loss += distributed_train_step(x)
  File "/scratch/user/apc5/tf-gpu/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/scratch/user/apc5/tf-gpu/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node cond/Placeholder_1 defined at (most recent call last):
  File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>

Detected at node cond/Placeholder_1 defined at (most recent call last):
  File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>

Detected at node cond/Placeholder_1 defined at (most recent call last):
  File "/home/apc5/SMA_NBO/dist_train.py", line 159, in <module>

3 root error(s) found.
  (0) INVALID_ARGUMENT:  You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
	 [[{{node cond/Placeholder_1}}]]
	 [[cond/else/_1/cond/StatefulPartitionedCall/sequential/dense/bias/replica_1/Initializer/ReadVariableOp/_276]]
	 [[cond/then/_0/cond/StatefulPartitionedCall/add/_306]]
  (1) INVALID_ARGUMENT:  You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
	 [[{{node cond/Placeholder_1}}]]
	 [[cond/else/_1/cond/StatefulPartitionedCall/sequential/dense/bias/replica_1/Initializer/ReadVariableOp/_276]]
  (2) INVALID_ARGUMENT:  You must feed a value for placeholder tensor 'cond/else/_1/cond/Placeholder_1' with dtype int32
	 [[{{node cond/Placeholder_1}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_fn_with_cond_4275]

CODE:

# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

print(tf.__version__)

# --- START HPRC ---
# def load_local_fashion_mnist(data_dir):
#     """Load Fashion MNIST dataset from local files."""
#     def load_images(file_path):
#         with open(file_path, 'rb') as f:
#             _ = f.read(16)  # Skip header
#             data = np.frombuffer(f.read(), dtype=np.uint8)
#         return data.reshape(-1, 28, 28)

#     def load_labels(file_path):
#         with open(file_path, 'rb') as f:
#             _ = f.read(8)  # Skip header
#             data = np.frombuffer(f.read(), dtype=np.uint8)
#         return data

#     train_images = load_images(os.path.join(data_dir, 'train-images-idx3-ubyte'))
#     train_labels = load_labels(os.path.join(data_dir, 'train-labels-idx1-ubyte'))
#     test_images = load_images(os.path.join(data_dir, 't10k-images-idx3-ubyte'))
#     test_labels = load_labels(os.path.join(data_dir, 't10k-labels-idx1-ubyte'))

#     return (train_images, train_labels), (test_images, test_labels)

# # Path to the directory containing the dataset files
# data_directory = "/scratch/user/apc5/fashion/"

# # Load the dataset
# (train_images, train_labels), (test_images, test_labels) = load_local_fashion_mnist(data_directory)

# # Verify dataset loading
# print(f"Training data shape: {train_images.shape}")
# print(f"Testing data shape: {test_images.shape}")
# print(f"Number of classes: {len(set(train_labels))}")
# --- END HPRC ---

fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Add a dimension to the array -> new shape == (28, 28, 1)
# This is done because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Scale the images to the [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

# If the list of devices is not specified in
# `tf.distribute.MirroredStrategy` constructor, they will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

BUFFER_SIZE = len(train_images)

BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

EPOCHS = 10

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)

train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)

def create_model():
  regularizer = tf.keras.regularizers.L2(1e-5)
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3,
                             activation='relu',
                             kernel_regularizer=regularizer),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3,
                             activation='relu',
                             kernel_regularizer=regularizer),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64,
                            activation='relu',
                            kernel_regularizer=regularizer),
      tf.keras.layers.Dense(10, kernel_regularizer=regularizer)
    ])

  return model

with strategy.scope():
  # Set reduction to `NONE` so you can do the reduction yourself.
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True,
      reduction=tf.keras.losses.Reduction.NONE)
  def compute_loss(labels, predictions, model_losses):
    per_example_loss = loss_object(labels, predictions)
    loss = tf.nn.compute_average_loss(per_example_loss)
    if model_losses:
      loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
    return loss
  
with strategy.scope():
  test_loss = tf.keras.metrics.Mean(name='test_loss')

  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='train_accuracy')
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy')
  
# A model, an optimizer, and a checkpoint must be created under `strategy.scope`.
with strategy.scope():
  model = create_model()

  optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

  checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

def train_step(inputs):
  images, labels = inputs

  with tf.GradientTape() as tape:
    predictions = model(images, training=True)
    loss = compute_loss(labels, predictions, model.losses)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_accuracy.update_state(labels, predictions)
  return loss

def test_step(inputs):
  images, labels = inputs

  predictions = model(images, training=False)
  t_loss = loss_object(labels, predictions)

  test_loss.update_state(t_loss)
  test_accuracy.update_state(labels, predictions)

# `run` replicates the provided computation and runs it
# with the distributed input.
@tf.function
def distributed_train_step(dataset_inputs):
  per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
  return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                         axis=None)

@tf.function
def distributed_test_step(dataset_inputs):
  return strategy.run(test_step, args=(dataset_inputs,))

for epoch in range(EPOCHS):
  # TRAIN LOOP
  total_loss = 0.0
  num_batches = 0
  for x in train_dist_dataset:
    total_loss += distributed_train_step(x)
    num_batches += 1
  train_loss = total_loss / num_batches

  # TEST LOOP
  for x in test_dist_dataset:
    distributed_test_step(x)

  template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
              "Test Accuracy: {}")
  print(template.format(epoch + 1, train_loss,
                         train_accuracy.result() * 100, test_loss.result(),
                         test_accuracy.result() * 100))

  test_loss.reset_states()
  train_accuracy.reset_states()
  test_accuracy.reset_states()

Steps To Reproduce

module load GCCcore/12.2.0 Python/3.10.8 # ← Can be ignored if not running on HPRC
python3 -m venv --system-site-packages ./tf-gpu
source tf-gpu/bin/activate
python3 -m pip install tensorflow[and-cuda]
python3 test_code.py

AakankshaS · January 31, 2025, 10:45am

Hi @apc5 ,
This forum talks about issue related to TRT.
I am afraid, i might not be able to assist you here.

Thanks

shuvrajeet17 · March 13, 2025, 4:50am

Faced the same issue while just trying to code up the example given in the site itself. If you find any solutions regarding this please provide! I will be thankful.

Topic		Replies	Views
No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models TensorRT	3	1565	February 21, 2019
automatic mixed precision failure Frameworks tensorflow	5	3122	October 4, 2021
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2913	January 18, 2019
ValueError: Input 1 of node StatefulPartitionedCall was passed float from conv2d/kernel:0 incompatible with expected resource. TensorRT	5	5047	April 13, 2021
Don't get any 'TRTEngineOp' after optimizing model via TensorRT in Jeton TX2 TensorRT	17	3673	October 12, 2021
Extremely long time to load TRT-optimized frozen TF graphs TensorRT	31	10086	October 12, 2021
TF-TRT Model recompiles when loading it in a different Python Scope TensorRT	2	509	July 27, 2023
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	744	April 30, 2024
TensorRT graph to slow TensorRT	15	1561	January 5, 2020
Tensorrt fails for custom ssd_inception Model TensorRT	18	2807	May 14, 2020

Tensorflow 2.18.0 MirroredStrategy Fail to Train with Multiple GPUs

Description

Environment

Relevant Files

Steps To Reproduce

Related topics