Multiple executive warnings after switching tensorflow from 2.16.1 CPU to v60dp tensorflow==2.15.0+nv24.03 GPU version

After switch tensorflow from 2.16.1 CPU to v60dp tensorflow==2.15.0+nv24.03 GPU version.

I got below warnings, any one know why? Is there any way to fix the warnings?

  • issue 1: Unable to register cuDNN factory
2024-04-29 10:24:41.927931: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 10:24:41.928049: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 10:24:41.936001: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  • issue 2: could not open file to read NUMA node
2024-04-29 10:24:52.248622: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.369514: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.369787: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.371141: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.371276: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.371371: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.591879: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.592299: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.592376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2019] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-04-29 10:24:52.592525: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-04-29 10:24:52.592619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1815 MB memory:  -> device: 0, name: Orin, pci bus id: 0000:00:00.0, compute capability: 8.7
2024-04-29 10:24:54.802741: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
  • issue 3: ran out of memory
2024-04-29 10:25:06.717116: I external/local_xla/xla/service/service.cc:168] XLA service 0xaaab20b03270 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-29 10:25:06.717227: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Orin, Compute Capability 8.7
2024-04-29 10:25:06.865925: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-29 10:25:07.832752: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:467] Loaded cuDNN version 8904
2024-04-29 10:25:16.422695: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:16.824818: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:17.273160: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:17.273278: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:18.320255: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:18.568112: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:18.966801: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:19.409092: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:19.409221: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:25:20.455916: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-29 10:26:59.030551: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng1{k2=7,k3=0} for conv (f32[512,512,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,512,28,28]{3,2,1,0}, f32[32,512,28,28]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0} is taking a while...
2024-04-29 10:26:59.116775: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.086371749s
Trying algorithm eng1{k2=7,k3=0} for conv (f32[512,512,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,512,28,28]{3,2,1,0}, f32[32,512,28,28]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0} is taking a while...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1714357630.751689   67799 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
W0000 00:00:1714357631.213201   67799 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
38/38 ━━━━━━━━━━━━━━━━━━━━ 0s 3s/step - accuracy: 0.0186 - loss: 7.0540   
W0000 00:00:1714357734.120590   67798 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
38/38 ━━━━━━━━━━━━━━━━━━━━ 243s 3s/step - accuracy: 0.0186 - loss: 6.9924 - val_accuracy: 0.0136 - val_loss: 3.7653
Epoch 2/101
38/38 ━━━━━━━━━━━━━━━━━━━━ 54s 833ms/step - accuracy: 0.0110 - loss: 3.7623 - val_accuracy: 0.0136 - val_loss: 3.7600
Epoch 3/101
38/38 ━━━━━━━━━━━━━━━━━━━━ 31s 833ms/step - accuracy: 0.0203 - loss: 3.7614 - val_accuracy: 0.0329 - val_loss: 3.7616
Epoch 4/101

Finally, the result is wrong, comapred with tensorflow 2.16.1 CPU.

Hi,

Could you share a reproducible source/steps with us so we can check it further?
Please also share the CPU-based script so we can compare.

Thanks.

I have to admit that what I said may not be correct. But I think there must be something wrong here.

The result of fine tunning of loss and accuracy is NOT as it’s supposed to be. Both results from Jetson Orin Nano (2.16.1 CPU and 2.15.0 GPU) are flat lines. And the value is quite large here. Here is my output: Keras-Fine-Tune-Pre-Trained-Models-GTSRB.zip (5.7 MB)

And it should be something like below:

Please check this demo code : learnopencv/Keras-Fine-Tuning-Pre-Trained-Models at master · spmallick/learnopencv · GitHub

EDIT: following version can’t work at all, I’m using v60dp tensorflow==2.15.0+nv24.03.
I hope there is a more stable version with too much error/warnings.

  • v60dp tensorflow==2.15.0+nv24.02, fit failed
Epoch 1/101
2024-04-30 08:58:50.279287: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 19267584 exceeds 10% of free system memory.
2024-04-30 08:58:50.297470: I external/local_xla/xla/service/service.cc:168] XLA service 0xaaaae3d14e30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-30 08:58:50.297593: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Orin, Compute Capability 8.7
2024-04-30 08:58:50.306346: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 19267584 exceeds 10% of free system memory.
2024-04-30 08:58:50.416977: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-30 08:58:51.336289: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:467] Loaded cuDNN version 8904
2024-04-30 08:58:54.643736: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:54.643880: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:54.643949: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:54.644004: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:56.747669: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:56.747792: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:56.747841: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:56.747886: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 426.38MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:57.381132: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 408.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:57.381293: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 800.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 08:58:57.783661: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:574 : UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv-bias-activation.40 = (f32[32,64,224,224]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,64,224,224]{3,2,1,0} %get-tuple-element.93, f32[64,64,3,3]{3,2,1,0} %transpose.2, f32[64]{0} %arg6.7), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", metadata={op_type="Conv2D" op_name="functional_1_1/vgg16_1/block1_conv2_1/convolution" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1160}, backend_config={"conv_result_scale":1,"activation_mode":"kRelu","side_input_scale":0,"leakyrelu_alpha":0}

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427819008 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.
---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
Cell In[25], line 2
      1 # Train the Model.
----> 2 training_results = model_vgg16_finetune.fit(train_dataset,
      3                                             epochs=TrainingConfig.EPOCHS,
      4                                             validation_data=valid_dataset,
      5                                            )

File ~/.local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

UnknownError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/daniel/.local/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/daniel/.local/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 205, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_95770/756169661.py", line 2, in <module>

  File "/home/daniel/.local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/daniel/.local/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit

  File "/home/daniel/.local/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

Failed to determine best cudnn convolution algorithm for:
%cudnn-conv-bias-activation.40 = (f32[32,64,224,224]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,64,224,224]{3,2,1,0} %get-tuple-element.93, f32[64,64,3,3]{3,2,1,0} %transpose.2, f32[64]{0} %arg6.7), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", metadata={op_type="Conv2D" op_name="functional_1_1/vgg16_1/block1_conv2_1/convolution" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1160}, backend_config={"conv_result_scale":1,"activation_mode":"kRelu","side_input_scale":0,"leakyrelu_alpha":0}

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427819008 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.
	 [[{{node StatefulPartitionedCall}}]] [Op:__inference_one_step_on_iterator_4075]
  • v60dp tensorflow==2.14.0+nv23.11, import error
2024-04-30 09:02:04.798506: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9360] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 09:02:04.798662: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 09:02:04.798895: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1537] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
AttributeError: module 'ml_dtypes' has no attribute 'float8_e4m3b11'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 11
      8 import requests
      9 import glob as glob
---> 11 import tensorflow as tf
     12 from tensorflow import keras
     13 from tensorflow.keras import layers

File /usr/local/lib/python3.10/dist-packages/tensorflow/__init__.py:38
     35 import sys as _sys
     36 import typing as _typing
---> 38 from tensorflow.python.tools import module_util as _module_util
     39 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
     41 # Make sure code inside the TensorFlow codebase can use tf2.enabled() at import.

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/__init__.py:42
     36 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
     38 # pylint: enable=wildcard-import
     39 
     40 # from tensorflow.python import keras
     41 # from tensorflow.python.layers import layers
---> 42 from tensorflow.python.saved_model import saved_model
     43 from tensorflow.python.tpu import api
     45 # Sub-package for performing i/o directly instead of via ops in a graph.

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/saved_model.py:20
     15 """Convenience functions to save a model.
     16 """
     19 # pylint: disable=unused-import
---> 20 from tensorflow.python.saved_model import builder
     21 from tensorflow.python.saved_model import constants
     22 from tensorflow.python.saved_model import loader

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/builder.py:23
     15 """SavedModel builder.
     16 
     17 Builds a SavedModel that can be saved to storage, is language neutral, and
     18 enables systems to produce, consume, or transform TensorFlow Models.
     19 
     20 """
     22 # pylint: disable=unused-import
---> 23 from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
     24 from tensorflow.python.saved_model.builder_impl import SavedModelBuilder
     25 # pylint: enable=unused-import

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/builder_impl.py:26
     24 from tensorflow.core.protobuf import saved_model_pb2
     25 from tensorflow.core.protobuf import saver_pb2
---> 26 from tensorflow.python.framework import dtypes
     27 from tensorflow.python.framework import ops
     28 from tensorflow.python.framework import tensor

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/dtypes.py:39
     37 _np_bfloat16 = pywrap_ml_dtypes.bfloat16()
     38 _np_float8_e4m3fn = pywrap_ml_dtypes.float8_e4m3fn()
---> 39 _np_float8_e5m2 = pywrap_ml_dtypes.float8_e5m2()
     42 class DTypeMeta(type(_dtypes.DType), abc.ABCMeta):
     43   pass

TypeError: Unable to convert function return value to a Python type! The signature was
	() -> handle

I saw a new version on repo 2.15.0+nv24.04 today. But it failed when fit

2024-04-30 09:09:27.049383: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 12845056 exceeds 10% of free system memory.
2024-04-30 09:09:28.607928: I external/local_xla/xla/service/service.cc:168] XLA service 0xaaaad65be690 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-30 09:09:28.608051: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Orin, Compute Capability 8.7
2024-04-30 09:09:28.737529: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-30 09:09:29.664603: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:467] Loaded cuDNN version 8904
2024-04-30 09:09:31.045111: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 408.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-04-30 09:09:31.065952: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:574 : UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv-bias-activation.39 = (f32[32,64,224,224]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,3,224,224]{3,2,1,0} %add.124, f32[64,3,3,3]{3,2,1,0} %transpose.1, f32[64]{0} %arg4.5), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", metadata={op_type="Conv2D" op_name="functional_1_1/vgg16_1/block1_conv1_1/convolution" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1160}, backend_config={"conv_result_scale":1,"activation_mode":"kRelu","side_input_scale":0,"leakyrelu_alpha":0}

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427819008 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.
---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
Cell In[25], line 2
      1 # Train the Model.
----> 2 training_results = model_vgg16_finetune.fit(train_dataset,
      3                                             epochs=TrainingConfig.EPOCHS,
      4                                             validation_data=valid_dataset,
      5                                            )

File ~/.local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File /usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

UnknownError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/daniel/.local/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/daniel/.local/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 205, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/daniel/.local/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/daniel/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_101349/756169661.py", line 2, in <module>

  File "/home/daniel/.local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/daniel/.local/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit

  File "/home/daniel/.local/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

Failed to determine best cudnn convolution algorithm for:
%cudnn-conv-bias-activation.39 = (f32[32,64,224,224]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,3,224,224]{3,2,1,0} %add.124, f32[64,3,3,3]{3,2,1,0} %transpose.1, f32[64]{0} %arg4.5), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", metadata={op_type="Conv2D" op_name="functional_1_1/vgg16_1/block1_conv1_1/convolution" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1160}, backend_config={"conv_result_scale":1,"activation_mode":"kRelu","side_input_scale":0,"leakyrelu_alpha":0}

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427819008 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.
	 [[{{node StatefulPartitionedCall}}]] [Op:__inference_one_step_on_iterator_4075]

Hi,

Which Orin Nano do you use? 8GB or 4GB?
It looks like you tried to deploy a training job on Orin Nano which might not be suitable since it’s designed for edge inference.

Basically, all the output reports out of memory error like below:

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427819008 bytes.

And the training work is not applied correctly.
You can try to lower the batch size to see if it helps.

Thanks.

8GB

Well, I thought it was a demo, just 3, 216, 939 params for a new classifier update, easy job for jetson orin. I’m definitly wrong.

No. The demo only can use 32, 8/16/24 is not applicable.

Hi,

Jetson is a shared memory system so all the CPU/GPU use cases use the same 8GB memory.
TensorFlow is a heavy library so most of the memory may be used for loading the memory rather than training jobs.

Thanks.

Thanks, I just used Colab to run the program. It should be memory issue: GPU needs about 8.1GB.

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427819008 bytes.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.