TAO Toolkit exits with "Kill" without reason

ColinPs26kt · February 12, 2022, 4:36am

Please provide the following information when requesting support.

• Hardware : RTX2070
• Network Type: Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): tao-toolkit-tf:v3.21.11-tf1.15.4-py3
• Training spec file: attached below
• How to reproduce the issue ? Please see below

Hi,

I’m trying to train my own resnet18 detector using custom dataset. I’ve done all the necessary steps (conversion to ktti, etc). I’m able to run the “train” using command below:

detectnet_v2 train -e /workspace/tao-experiments/specs/combined_training_config.txt -r /workspace/tao-experiments/data/resnet18_detector -k tlt_encode -n resnet18_detector

However, after a point in time, I get this message:

The process is “Killed” and I don’t know why.

The GPU was being utilized successfully:

The “resnet18_detector” folder for the model output along with the “weights” subfolder was created succesfully. However, no weights were generated.

I’m pretty certain I got all the mountings set correctly as all earlier errors with regard to that have been resolved and all directories can be found by the tao docker. I’m attaching the status.json file generated along with the config file I used.
status.json (1.3 KB)
combined_training_config.txt (7.6 KB)

Please let me know how I may debug this issue.

Thank you!

Morganh · February 12, 2022, 1:39pm

It may be related to OOM. Please try lower batch size.

ColinPs26kt · February 15, 2022, 10:52pm

Hi @Morganh,

You were right. I lowered the batch size and I could continue. Thanks!

I switched over to try yolov4 and now get a different error.

I used the command:
yolo_v4 train -e /workspace/tao-experiments/specs/combined_training_config_yolov4.txt -r /workspace/tao-experiments/data/yolov4_detector -k tlt_encode

This is my new config.
combined_training_config_yolov4.txt (2.4 KB)

The logs are somehow empty.

It’s similar to the issue in an earlier ticket (Errore CUDA failure 'an illegal memory access was encountered') but I didn’t see a solution there.

Morganh · February 16, 2022, 2:07am

For this yolov4 training, could you please try to run with sequence format instead of tfrecord format?

Morganh · February 16, 2022, 6:03am

Unfortunately, in 3.21.11 version, there is an issue for “yolo_v4 evaluate”.
Please change to sequence format as below.

validation_data_sources: {
label_directory_path: “xxx”
image_directory_path: “xxx”
}

Morganh · February 16, 2022, 8:50am

For training or evaluation on tfrecord files, please set
force_on_cpu : True

Setting it to True will force NMS to run on CPU during training. This is useful when using TFRecord dataset for validation during training since there is a known issue with TensorFlow NMS on GPU when using TFRecord dataset for validation. Note Note that this flag does not have any impact on TAO export and TensorRT/DeepStream inference.

See more in
https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/yolo_v4.html#nms-config

ColinPs26kt · February 17, 2022, 2:01am

Hi @Morganh,

Changing to sequence format works for me. Thanks!

I’d like to clarify about the “randomize_input_shape_period” parameter. I need to set it to 0 before I can get anything to work.

When I set it to 100, I get this error.
combined_training_config_yolov4 (run 4 - doesn’t work).txt (2.5 KB)

Epoch 1/10
2/2030 […] - ETA: 133:17:02 - loss: 8637.0806/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.847671). Check your callbacks.
% delta_t_median)
100/2030 [>…] - ETA: 3:31:39 - loss: 8542.0017[8adbf8fc3105:00591] *** Process received signal ***
[8adbf8fc3105:00591] Signal: Segmentation fault (11)
[8adbf8fc3105:00591] Signal code: Address not mapped (1)
[8adbf8fc3105:00591] Failing at address: 0x10
[8adbf8fc3105:00591] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fef77a83040]
[8adbf8fc3105:00591] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor3addIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7fef0af30f90]
[8adbf8fc3105:00591] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x522)[0x7fef04d1c382]
[8adbf8fc3105:00591] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf978ab)[0x7fef04d7d8ab]
[8adbf8fc3105:00591] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf97c6f)[0x7fef04d7dc6f]
[8adbf8fc3105:00591] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7fef04e2d791]
[8adbf8fc3105:00591] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fef04e2adf8]
[8adbf8fc3105:00591] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7fef6b3546df]
[8adbf8fc3105:00591] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fef7782c6db]
[8adbf8fc3105:00591] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fef77b6571f]
[8adbf8fc3105:00591] *** End of error message ***
Segmentation fault (core dumped)

When I set it to 0, everything runs fine.
combined_training_config_yolov4 (run 5 - works)).txt (2.5 KB)

Morganh · February 17, 2022, 2:58am

May I know the resolution of your training images?
BTW, can this error reproduce with public KITTI dataset?

ColinPs26kt · February 17, 2022, 10:45am

@Morganh,

I’m using a dataset that has a wide range of widths and heights. Image widths range from 366 to 3872 and heights range from 297 to 2736 pixels.

I have not tried the public KITTI dataset.

Morganh · February 17, 2022, 10:52am

Thanks for the info. I am curious about the error you mentioned earlier from“randomize_input_shape_period: 100” .
If possible, could you use parts of training dataset to check if it can be reproduced?

Morganh · February 17, 2022, 10:56am

BTW, are you using tao-toolkit-tf:v3.21.11-tf1.15.5-py3 to train yolo_v4?
From the log

you are running inside the docker, right? How did you trigger the docker?

ColinPs26kt · February 17, 2022, 11:23am

I’m using tao-toolkit-tf:v3.21.11-tf1.15.4-py3.

I intially went to the python virtual environment using these 3 commands:

export VIRTUALENVWRAPPER_PYTHON=‘/usr/bin/python3’
source /usr/local/bin/virtualenvwrapper.sh
mkvirtualenv launcher

Then I decided to launch the docker directly inside the virtual environment:

tao detectnet_v2 run /bin/bash

Morganh · February 17, 2022, 11:54am

Thanks for the info.
For the error, please try again after using below. It will use 3.21.11-tf1.15.5-py3 docker.
tao yolo_v4 run /bin/bash

ColinPs26kt · February 28, 2022, 11:10am

Hi @Morganh ,

Yes you’re right. I switched to use the 3.21.11-tf1.15.5-py3 docker and it’s fine now. Everything is working well. Thanks.

Colin

system · March 14, 2022, 11:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why it is killed when start training with tao toolkit? TAO Toolkit	8	630	October 12, 2021
Training got killed before start TAO Toolkit	18	1620	February 8, 2022
Unable to train yolov4 with Tao succesfully TAO Toolkit	6	610	April 28, 2023
Original error: could not get source code TAO Toolkit	8	766	July 6, 2022
Training yolov4 tiny issue TAO Toolkit	11	504	March 21, 2024
TAO 5.0 failed to train TAO Toolkit	8	649	August 1, 2023
Yolo V4 Training Error TAO Toolkit	3	698	August 2, 2022
DataLossError: corrupted record at 0 when using TFRecords with DetectNet TAO Toolkit	36	6147	February 18, 2022
TAO yolov4_tiny training fails with error TAO Toolkit	4	653	February 2, 2023
Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] TAO Toolkit	51	1948	July 26, 2022

TAO Toolkit exits with "Kill" without reason

Related topics