TAO Toolkit exits with "Kill" without reason

Please provide the following information when requesting support.

• Hardware : RTX2070
• Network Type: Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): tao-toolkit-tf:v3.21.11-tf1.15.4-py3
• Training spec file: attached below
• How to reproduce the issue ? Please see below

Hi,

I’m trying to train my own resnet18 detector using custom dataset. I’ve done all the necessary steps (conversion to ktti, etc). I’m able to run the “train” using command below:

detectnet_v2 train -e /workspace/tao-experiments/specs/combined_training_config.txt -r /workspace/tao-experiments/data/resnet18_detector -k tlt_encode -n resnet18_detector

However, after a point in time, I get this message:

The process is “Killed” and I don’t know why.

The GPU was being utilized successfully:

The “resnet18_detector” folder for the model output along with the “weights” subfolder was created succesfully. However, no weights were generated.

I’m pretty certain I got all the mountings set correctly as all earlier errors with regard to that have been resolved and all directories can be found by the tao docker. I’m attaching the status.json file generated along with the config file I used.
status.json (1.3 KB)
combined_training_config.txt (7.6 KB)
image

Please let me know how I may debug this issue.

Thank you!

It may be related to OOM. Please try lower batch size.

Hi @Morganh,

You were right. I lowered the batch size and I could continue. Thanks!

I switched over to try yolov4 and now get a different error.

I used the command:
yolo_v4 train -e /workspace/tao-experiments/specs/combined_training_config_yolov4.txt -r /workspace/tao-experiments/data/yolov4_detector -k tlt_encode

This is my new config.
combined_training_config_yolov4.txt (2.4 KB)

The logs are somehow empty.

It’s similar to the issue in an earlier ticket (Errore CUDA failure 'an illegal memory access was encountered') but I didn’t see a solution there.

For this yolov4 training, could you please try to run with sequence format instead of tfrecord format?

Unfortunately, in 3.21.11 version, there is an issue for “yolo_v4 evaluate”.
Please change to sequence format as below.

validation_data_sources: {
label_directory_path: “xxx”
image_directory_path: “xxx”
}

For training or evaluation on tfrecord files, please set
force_on_cpu : True

Setting it to True will force NMS to run on CPU during training. This is useful when using TFRecord dataset for validation during training since there is a known issue with TensorFlow NMS on GPU when using TFRecord dataset for validation. Note Note that this flag does not have any impact on TAO export and TensorRT/DeepStream inference.

See more in
https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/yolo_v4.html#nms-config

Hi @Morganh,

Changing to sequence format works for me. Thanks!

I’d like to clarify about the “randomize_input_shape_period” parameter. I need to set it to 0 before I can get anything to work.

When I set it to 100, I get this error.
combined_training_config_yolov4 (run 4 - doesn’t work).txt (2.5 KB)

Epoch 1/10
2/2030 […] - ETA: 133:17:02 - loss: 8637.0806/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.847671). Check your callbacks.
% delta_t_median)
100/2030 [>…] - ETA: 3:31:39 - loss: 8542.0017[8adbf8fc3105:00591] *** Process received signal ***
[8adbf8fc3105:00591] Signal: Segmentation fault (11)
[8adbf8fc3105:00591] Signal code: Address not mapped (1)
[8adbf8fc3105:00591] Failing at address: 0x10
[8adbf8fc3105:00591] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fef77a83040]
[8adbf8fc3105:00591] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor3addIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7fef0af30f90]
[8adbf8fc3105:00591] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x522)[0x7fef04d1c382]
[8adbf8fc3105:00591] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf978ab)[0x7fef04d7d8ab]
[8adbf8fc3105:00591] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf97c6f)[0x7fef04d7dc6f]
[8adbf8fc3105:00591] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7fef04e2d791]
[8adbf8fc3105:00591] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fef04e2adf8]
[8adbf8fc3105:00591] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7fef6b3546df]
[8adbf8fc3105:00591] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fef7782c6db]
[8adbf8fc3105:00591] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fef77b6571f]
[8adbf8fc3105:00591] *** End of error message ***
Segmentation fault (core dumped)

When I set it to 0, everything runs fine.
combined_training_config_yolov4 (run 5 - works)).txt (2.5 KB)

May I know the resolution of your training images?
BTW, can this error reproduce with public KITTI dataset?

@Morganh,

I’m using a dataset that has a wide range of widths and heights. Image widths range from 366 to 3872 and heights range from 297 to 2736 pixels.

I have not tried the public KITTI dataset.

Thanks for the info. I am curious about the error you mentioned earlier from“randomize_input_shape_period: 100” .
If possible, could you use parts of training dataset to check if it can be reproduced?

BTW, are you using tao-toolkit-tf:v3.21.11-tf1.15.5-py3 to train yolo_v4?
From the log

you are running inside the docker, right? How did you trigger the docker?

I’m using tao-toolkit-tf:v3.21.11-tf1.15.4-py3.

I intially went to the python virtual environment using these 3 commands:

export VIRTUALENVWRAPPER_PYTHON=‘/usr/bin/python3’
source /usr/local/bin/virtualenvwrapper.sh
mkvirtualenv launcher

Then I decided to launch the docker directly inside the virtual environment:

tao detectnet_v2 run /bin/bash

Thanks for the info.
For the error, please try again after using below. It will use 3.21.11-tf1.15.5-py3 docker.
tao yolo_v4 run /bin/bash

Hi @Morganh ,

Yes you’re right. I switched to use the 3.21.11-tf1.15.5-py3 docker and it’s fine now. Everything is working well. Thanks.

Colin

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.