Can 40 Series RTX products use tao-toolkit?

zyann · November 23, 2022, 10:39am

RTX4090, RTX4080 or others.

Morganh · November 24, 2022, 9:10am

The 40 series RTX products are based on new NVIDIA Ada Lovelace architecture.
Current TAO22.05 is not verified on them.
So, not sure the status yet.

zyann · November 24, 2022, 9:23am

Thanks for answering, I tried it on RTX4090 and it doesn’t work now. Hope to support it later.

Morganh · November 24, 2022, 9:25am

Thanks. Could you share some logs?

zyann · November 24, 2022, 10:07am

Information about the mask_rcnn trained with the RTX4090:

2022-11-23 18:11:23,753 [INFO] root: Registry: ['nvcr.io']
2022-11-23 18:11:23,793 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-11-23 18:11:23,798 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/nxin/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
[MaskRCNN] INFO    : Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO    : Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt
[INFO] Log file already exists at /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/status.json
[INFO] Starting MaskRCNN training.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpg5v92omu', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 4
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fcf06a727f0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 01
[MaskRCNN] INFO    : =================================
    
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._prefetch_dataset at 0x7fcf06a74950> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function InputReader.__call__.<locals>._prefetch_dataset at 0x7fcf06a74950>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <function dataset_parser at 0x7fcf123ad730> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function dataset_parser at 0x7fcf123ad730>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
WARNING:tensorflow:Entity <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7fcffa76e9e8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7fcffa76e9e8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7fcff48d1320>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7fcff48d1320>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
WARNING:tensorflow:Entity <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7fcff48d1748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7fcff48d1748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fcff47a7748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fcff47a7748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff4550eb8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff4550eb8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff47b20f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff47b20f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff458b208>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff458b208>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7fcff4513780>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7fcff4513780>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7fcff4550f98>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7fcff4550f98>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fcff4550f60>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fcff4550f60>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff43c1a90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fcff43c1a90>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7fcff3adfd68>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7fcff3adfd68>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 793.9 GFLOPS/image
WARNING:tensorflow:Entity <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7fcedc091438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7fcedc091438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpg5v92omu/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (307 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
[INFO] 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[47,448,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator gpu_host_bfc
	 [[{{node parser/cond/then/_9/map/TensorArrayV2Stack/TensorListStack}}]]
	 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[IteratorGetNext/_4707]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[47,448,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator gpu_host_bfc
	 [[{{node parser/cond/then/_9/map/TensorArrayV2Stack/TensorListStack}}]]
	 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_1138}} OOM when allocating tensor with shape[47,448,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator gpu_host_bfc
	 [[{{node parser/cond/then/_9/map/TensorArrayV2Stack/TensorListStack}}]]
	 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[IteratorGetNext/_4707]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_1138}} OOM when allocating tensor with shape[47,448,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator gpu_host_bfc
	 [[{{node parser/cond/then/_9/map/TensorArrayV2Stack/TensorListStack}}]]
	 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 250, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 237, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 88, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 418, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[47,448,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator gpu_host_bfc
	 [[{{node parser/cond/then/_9/map/TensorArrayV2Stack/TensorListStack}}]]
	 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[IteratorGetNext/_4707]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[47,448,800] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator gpu_host_bfc
	 [[{{node parser/cond/then/_9/map/TensorArrayV2Stack/TensorListStack}}]]
	 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
2022-11-23 18:52:22,670 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The memory usage is probably up by 2G, but the GPU doesn’t seem to be working.
The same dataset and configuration I trained successfully on the RTX3090, the video memory is occupied at 18G.

Morganh · November 25, 2022, 2:03am

The error seems to be OOM. Can you use less tfrecord files and set lower batch-size?

More, can you run nvidia-smi successfully?

zyann · November 25, 2022, 3:03am

I run nvidia-smi successfully.
I’ve tried setting a lower batch doesn’t work.
At the same time, there is a problem, RTX3090 and RTX4090 have 24G video memory. RTX3090 does not have OOM but RTX4090 OOM is unreasonable. So I think it’s the RTX4090 that isn’t working.

Morganh · November 25, 2022, 3:18am

For 3090, does it run in WSL or linux machine?
How about 4090 as well?
BTW, if you have time, please try other network. For example, LPRnet.
Thanks.

zyann · November 25, 2022, 3:36am

3090 run in linux machine and 4090 run in WSL2. I’m not sure if this will make a difference.
I’ll try other network later with 4090.

zyann · November 25, 2022, 9:17am

LPRnet can be trained normally with 4090.

Morganh · November 25, 2022, 10:37am

Thanks for the info. So, seems that the 40 series products can work with TAO.
For OOM issue in Maskrcnn, please try to use:

less tfrecords. For example, only use one tfrecord file.

I also find some topics which are talking about Maskrcnn OOM under WSL .
For example, Issue while converting maskrcnn model to trt from etlt on Laptops - #23 by alaapdhall79 , its solution is temporarily increasing the “SWAP” Memory .

zyann · November 26, 2022, 9:36am

Thank you very much, I think you are right.
I tried everything under wsl and it didn’t work. So I installed linux and everything worked fine after that.
WSL may not be a good option.

system · December 10, 2022, 9:37am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.