Invalid PNG data, size 789337

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Yolo_v4)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Hi, The same dataset, same docker, same config file, same ubuntu, different machine. I trained successfully on RTX 2080ti. However, in RTX 3080ti, the bus is shown below. Please help!

File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py”, line 692, in train
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5710}} Invalid PNG data, size 789337
[[{{node AssetLoader/DecodePng}}]]
[[data_loader_out]]

Please share the full command and full log.

Thank for your reply:
Command:

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 1

Logs and spec file
log.txt (68.2 KB)
yolo_v4_train_resnet18_kitti.txt (2.4 KB)

is bad memory causing this problem?

According to “tao info --verbose”, for yolov4, please use nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 instead.

More ,you can use part of the training tfrecords to narrow down.

Yes. I have narrowed our dataset. with a small number of datasets that are checked very carefully. I still have this problem.
Will check with nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 and let you know later

Should be using 15.5.

Sorry, my bad. i have tried with v3.22.05-tf1.15.5-py3 and still got the same error.

To narrow down, I suggest you to use sequence format instead.
See more in YOLOv4 — TAO Toolkit 3.22.05 documentation

To find out which image is the culprit.

Thank you.

Hi @Morganh The sequence format is used. Got the same problem. All the image/label in the log is normal.
log.txt (4.7 KB)
Is this a problem caused by RAM? I checked RAM, my RAM has the problem as above.

Suggest trying another machine or rebooting this machine.

Thanks you vey much

From the log, the culprit is /workspace/tao-experiments/data/training/image_2/on the floor-6-237.png . Please check it.