TAO - PIL.Image.DecompressionBombError

Hi,

I’m running the notebook provided in TAO Toolkit Getting Started | NVIDIA NGC to train a model with a custom dataset.
I’m using an AWS VM and followed the instructions from Running TAO Toolkit on an AWS VM - NVIDIA Docs

It worked a couple times already, but now I need to use bigger images (13190 x 15880 pixels), and when I run the command to convert training data to TFRecords:

!tao model mask_rcnn dataset_convert -i $DATA_DOWNLOAD_DIR/raw-data/train2017
-a $DATA_DOWNLOAD_DIR/raw-data/annotations/instances_train2017.json
-o $DATA_DOWNLOAD_DIR/maskrcnn --include_masks -t train -s 256
-r $USER_EXPERIMENT_DIR/

It returns the following:

2023-12-07 14:25:33,862 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr .io’]
2023-12-07 14:25:33,939 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr .io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-12-07 14:25:33,960 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 262:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2023-12-07 14:25:33,960 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-12-07 14:25:35.158350: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-12-07 14:25:35,218 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2023-12-07 14:25:37.489631: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2023-12-07 14:25:37,705 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-12-07 14:25:37,756 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-12-07 14:25:37,761 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-12-07 14:25:38,425 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2023-12-07 14:25:39.515731: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2023-12-07 14:25:39.538430: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/train
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/train
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:0 images are missing bboxes.
multiprocessing.pool.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/lib/python3.8/multiprocessing/pool.py”, line 125, in worker
result = (True, func(*args, **kwds))
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/dataset_convert.py”, line 199, in _pool_create_tf_example
return create_tf_example(*args)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/dataset_convert.py”, line 70, in create_tf_example
image = PIL.Image.open(encoded_jpg_io)
File “/usr/local/lib/python3.8/dist-packages/PIL/Image.py”, line 3016, in open
im = _open_core(fp, filename, prefix, formats)
File “/usr/local/lib/python3.8/dist-packages/PIL/Image.py”, line 3003, in _open_core
_decompression_bomb_check(im.size)
File “/usr/local/lib/python3.8/dist-packages/PIL/Image.py”, line 2912, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (200580280 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
“”"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/dataset_convert.py”, line 415, in
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/dataset_convert.py”, line 403, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/dataset_convert.py”, line 313, in main
log_total = _create_tf_record_from_coco_annotations(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/dataset_convert.py”, line 263, in create_tf_record_from_coco_annotations
for idx, (
, tf_example, num_annotations_skipped, log_warnings) in enumerate(
File “/usr/lib/python3.8/multiprocessing/pool.py”, line 868, in next
raise value
PIL.Image.DecompressionBombError: Image size (200580280 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
Execution status: FAIL
2023-12-07 14:25:44,176 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

So far, i’ve tried this solution: Pillow in Python won't let me open image ("exceeds limit") - Stack Overflow

I tried to edit the file:
/home/ubuntu/.virtualenvs/launcher/lib/python3.8/site-packages/PIL/Image.py
changing line: MAX_IMAGE_PIXELS = int(1024 * 1024 * 1024 / 4 / 3)
to: MAX_IMAGE_PIXELS = None

Also tried to acces the file ‘Image.py’ in:

%cd /usr/local/lib/python3.8/dist-packages/PIL

but it returned:

[Errno 2] No such file or directory: ‘/usr/local/lib/python3.8/dist-packages/PIL’
/usr/local/lib/python3.8/dist-packages

The folder ‘/usr/local/lib/python3.8/dist-packages’ appears to be empty.

Thanks in advance!

1 Like

Please login the docker as below.

$ docker run --runtime=nvidia -it --rm -v /home/morganh:/home/morganh nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

Then modify the /usr/local/lib/python3.8/dist-packages/PIL/Image.py.

Then still in the docker, run the commands without tao model in the beginning.
$ mask_rcnn dataset_convert

1 Like

I could login to docker and modify the Image.py

But i can’t connect to jupyter.

The command seems to work:

root@a706ae5d814e:/workspace# jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root --NotebookApp.token=123

returns

2023-12-08 19:03:05.311770: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
[I 19:03:06.444 NotebookApp] jupyter_tensorboard extension loaded.
[I 19:03:06.478 NotebookApp] JupyterLab extension loaded from /usr/local/lib/python3.8/dist-packages/jupyterlab
[I 19:03:06.478 NotebookApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 19:03:06.480 NotebookApp] [Jupytext Server Extension] NotebookApp.contents_manager_class is (a subclass of) jupytext.TextFileContentsManager already - OK
[I 19:03:06.481 NotebookApp] Serving notebooks from local directory: /workspace
[I 19:03:06.482 NotebookApp] Jupyter Notebook 6.4.10 is running at:
[I 19:03:06.482 NotebookApp] http://hostname:8888/?token=
[I 19:03:06.482 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

but when I try to connect with a browser I receive

This site can’t be reached

I managed to open Jupyter using:

docker run --runtime=nvidia -it --rm -v /home/morganh:/home/morganh -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

and

root@a0767de70673:/workspace# jupyter notebook --ip 0.0.0.0 --allow-root

like here: No web browser found: could not locate runnable browser - #6 by Morganh

I will try to run the notebook now.

Thank you Morganh

It worked.
Unfortunately I lost the files when I stopped the AWS EC2 instance and started again.
Do you have a suggestion so I can continue working where I left when I login to docker again?

Glad to know it is working now.
For your latest question, it is a question for AWS, suggest you to search some help from the network. For example, Stop and start your instance - Amazon Elastic Compute Cloud. Or request help from AWS. Thanks.

1 Like

hi @Morganh ,

I’m stuck at training now.
I uploaded the TFrecords to the notebook I was using before. Loged in to docker normaly: “docker login nvcr.io” because when I log in using your instructions I can’t keep the files when I log in/out again.

When I run training with the command:

!tao model mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                     -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned\
                     --gpus $NUM_GPUS

It returns an error without much information, just before when it would start the first step:

For multi-GPU, change --gpus based on your machine.
2023-12-14 19:42:47,812 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-12-14 19:42:47,906 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-12-14 19:42:47,919 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2023-12-14 19:42:47,919 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2023-12-14 19:42:49.232719: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-12-14 19:42:49,294 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2023-12-14 19:42:52.461370: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2023-12-14 19:42:52,747 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-12-14 19:42:52,819 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-12-14 19:42:52,829 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-12-14 19:42:53,799 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2023-12-14 19:42:55.164952: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2023-12-14 19:42:55.189553: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-12-14 19:42:58,429 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-12-14 19:42:58,482 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-12-14 19:42:58,487 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
[INFO] Starting MaskRCNN training.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp1mobhhvd', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 4
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f12c93cbc40>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Loading pretrained model...
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:254: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:257: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:258: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

[MaskRCNN] INFO    : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 01
[MaskRCNN] INFO    : =================================
    
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 1202.4 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Done calling model_fn.
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for epoch 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt.
Execution status: FAIL
2023-12-14 19:45:27,975 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Is it possible to be related to the images being bigger than PIL’s limit?

UPDATE:
I ran training using the docker you suggested. Encountered the exact same error.

It’s not PIL image size limit.

Can you share the spec file?
Also can you use a new result folder and retry?

-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new

I am thinking there is OOM(out-of-memory) during training. To narrow down, please set a lower input size in the spec file and retry.
For example,

image_size: "(3296, 3968)"

or

image_size: "(1648, 1984)"

or

image_size: "(842, 992)"

Also check gpu memory, etc.

More, you can also temporarily increase the “SWAP” Memory in the Linux system. Refer to Issue while converting maskrcnn model to trt from etlt on Laptops - #23 by alaapdhall79

It was OOM, I increased the “swap” memory. But was hit by another problem now.

I reached 100% in all cpu and everything froze. After a while, PuTTy showed: “PuTTy Fatal Error: Network error: Software caused connection abort.” And I can’t log back in to the VM.

I’m waiting a response from AWS sales team to approve more vCPUs, is it a reasonable path to solve this problem?

Is it being caused because the images are big, even though I set a low image_size in the specs file?

Attached, you may find a “print screen” from ‘nvtop’ and ‘htop’ when the system froze.
The spec file was:

seed: 123
use_amp: False
warmup_steps: 0
checkpoint: "/workspace/tao-experiments/mask_rcnn/pretrained_resnet50/pretrained_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[10000]"
learning_rate_decay_levels: "[0.5]"
num_epochs: 150
train_batch_size: 1
eval_batch_size: 1
num_steps_per_eval: 110
momentum: 0.9
l2_weight_decay: 0.0001
l1_weight_decay: 0.0
warmup_learning_rate: 0.0001
init_learning_rate: 0.001
num_examples_per_epoch: 81
visualize_images_summary: True

data_config{
    image_size: "(842, 992)"
    augment_input_data: False
    eval_samples: 22
    training_file_pattern: "/workspace/tao-experiments/data/maskrcnn/train*.tfrecord"
    validation_file_pattern: "/workspace/tao-experiments/data/maskrcnn/val*.tfrecord"
    val_json_file: "/workspace/tao-experiments/data/annotations/tfrecords/val.json"

    # dataset specific parameters
    num_classes: 17
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}

To narrow down, can you use a smallest part of tfrecord and retry? For example, use only one tfrecord file.

I changed only these in the spec file.

    training_file_pattern: "/workspace/tao-experiments/data/maskrcnn/tfrecords_output/train-00000-of-00040.tfrecord"
    validation_file_pattern: "/workspace/tao-experiments/data/maskrcnn/tfrecords_output/val-00000-of-00011.tfrecord"

Same result: 100% cpu and it freezes.

If possible, could you please try to use a local machine to check if it works?
Also, did you ever try to run 1 gpu as well?
More, is it possible to share val-00000-of-00011.tfrecord and val.json with me?
I can try to check if I can reproduce.

Yes I can try it, but it would take a few days to set up a machine here.

I’m not sure what did you mean. The machine has 1 Tesla V100 16 gb

Yes, here is a link to download: WeTransfer - Send Large Files & Share Photos Online - Up to 2GB Free

Just to update. It was memory.

I got an upgrade with AWS to a 32 vCPU 4x Tesla V100 16 gb machine, and now everything is running fine (using 95% of GPUs and 70% of CPUs and 68 gb memory).

Thank you @Morganh !

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.