Core dumped while re-training pruned Detectnet model

I’m trying to retrain a pruned model with the changes that are described in #re-training-the-pruned-model documentation.

The model starts training and then during random epochs it dies and gives core dumped error. Sometimes it’s one erroro, sometimes another.

Training log including many different experiments:
training_pruned_3_log.txt (1.0 MB)

errors:

INFO:tensorflow:global_step/sec: 1.42614
2022-04-20 07:21:34,514 [INFO] tensorflow: global_step/sec: 1.42614
INFO:tensorflow:epoch = 8.633986928104575, learning_rate = 0.00049999997, loss = 0.00067046296, step = 1321 (6.093 sec)
2022-04-20 07:21:35,095 [INFO] tensorflow: epoch = 8.633986928104575, learning_rate = 0.00049999997, loss = 0.00067046296, step = 1321 (6.093 sec)
2022-04-20 07:21:35.256558: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[c1284621e0a1:03455] *** Process received signal ***
[c1284621e0a1:03455] Signal: Aborted (6)
[c1284621e0a1:03455] Signal code:  (-6)
[c1284621e0a1:03455] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f6c8be52040]
[c1284621e0a1:03455] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f6c8be51fb7]
[c1284621e0a1:03455] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f6c8be53921]
[c1284621e0a1:03455] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x85fa784)[0x7f6c28e1e784]
[c1284621e0a1:03455] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f6c287d3507]
[c1284621e0a1:03455] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f6c287d3d9f]
[c1284621e0a1:03455] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f6c1fad0fa1]
[c1284621e0a1:03455] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f6c1face608]
[c1284621e0a1:03455] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f6c89d3c6df]
[c1284621e0a1:03455] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f6c8bbfb6db]
[c1284621e0a1:03455] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f6c8bf3471f]
[c1284621e0a1:03455] *** End of error message ***
Aborted (core dumped)
2022-04-20 08:25:50,176 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 23.591
INFO:tensorflow:epoch = 24.856209150326798, learning_rate = 0.00049999997, loss = 0.00029254958, step = 3803 (6.269 sec)
2022-04-20 08:25:52,829 [INFO] tensorflow: epoch = 24.856209150326798, learning_rate = 0.00049999997, loss = 0.00029254958, step = 3803 (6.269 sec)
2022-04-20 08:25:53.713622: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[c1284621e0a1:10710] *** Process received signal ***
[c1284621e0a1:10710] Signal: Aborted (6)
[c1284621e0a1:10710] Signal code:  (-6)
[c1284621e0a1:10710] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f7af5f41040]
[c1284621e0a1:10710] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f7af5f40fb7]
[c1284621e0a1:10710] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f7af5f42921]
[c1284621e0a1:10710] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x85fa784)[0x7f7a92f0d784]
[c1284621e0a1:10710] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f7a928c2507]
[c1284621e0a1:10710] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f7a928c2d9f]
[c1284621e0a1:10710] [ 6] 2022-04-20 08:25:53.721017: F ./tensorflow/core/util/gpu_launch_config.h:169] Check failed: err == cudaSuccess (700 vs. 0)
Aborted (core dumped)
2022-04-20 10:07:23,401 [INFO] tensorflow: global_step/sec: 2.9402
INFO:tensorflow:epoch = 36.937704918032786, learning_rate = 0.00049999997, loss = 0.000411291, step = 11266 (5.498 sec)
2022-04-20 10:07:24,881 [INFO] tensorflow: epoch = 36.937704918032786, learning_rate = 0.00049999997, loss = 0.000411291, step = 11266 (5.498 sec)
2022-04-20 10:07:25.939245: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[c1284621e0a1:28397] *** Process received signal ***
[c1284621e0a1:28397] Signal: Aborted (6)
[c1284621e0a1:28397] Signal code:  (-6)
[c1284621e0a1:28397] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f51774a1040]
[c1284621e0a1:28397] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f51774a0fb7]
[c1284621e0a1:28397] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f51774a2921]
[c1284621e0a1:28397] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x85fa784)[0x7f511446d784]
[c1284621e0a1:28397] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f5113e22507]
[c1284621e0a1:28397] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f5113e22d9f]
[c1284621e0a1:28397] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f510b11ffa1]
[c1284621e0a1:28397] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f510b11d608]
[c1284621e0a1:28397] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f517538b6df]
[c1284621e0a1:28397] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f517724a6db]
[c1284621e0a1:28397] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f517758371f]
[c1284621e0a1:28397] *** End of error message ***
Aborted (core dumped)

INFO:tensorflow:epoch = 5.79344262295082, learning_rate = 0.00014038816, loss = 0.0003366814, step = 1767 (5.538 sec)
2022-04-20 10:43:01,452 [INFO] tensorflow: epoch = 5.79344262295082, learning_rate = 0.00014038816, loss = 0.0003366814, step = 1767 (5.538 sec)
INFO:tensorflow:global_step/sec: 3.07895
2022-04-20 10:43:02,465 [INFO] tensorflow: global_step/sec: 3.07895
2022-04-20 10:43:02.721572: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[c1284621e0a1:07803] *** Process received signal ***
[c1284621e0a1:07803] Signal: Aborted (6)
[c1284621e0a1:07803] Signal code:  (-6)
[c1284621e0a1:07803] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f953f581040]
[c1284621e0a1:07803] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f953f580fb7]
[c1284621e0a1:07803] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f953f582921]
[c1284621e0a1:07803] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x85fa784)[0x7f94dc54d784]
[c1284621e0a1:07803] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f94dbf02507]
[c1284621e0a1:07803] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f94dbf02d9f]
[c1284621e0a1:07803] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f94d31fffa1]
[c1284621e0a1:07803] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f94d31fd608]
[c1284621e0a1:07803] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f953d46b6df]
[c1284621e0a1:07803] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f953f32a6db]
[c1284621e0a1:07803] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f953f66371f]
[c1284621e0a1:07803] *** End of error message ***
Aborted (core dumped)

First experiment spec :
initial_experiment_spec.txt (3.1 KB)
This training worked perfectly.

Pruning command:

detectnet_v2 prune -m TAO/results/people/detectnet/model.step-12240.tlt -o TAO/results/people/detectnet/model.step-12240-pruned.tlt -eq union -pth 0.1 -k tlt_ecode

Retraining pruned model experiment spec:
retraining_pruned_model_experiment_spec.txt (3.1 KB)

• Hardware: Tesla V100-SXM2, running in Docker
• Network Type: Detectnet_v2

Which version of docker did you use?
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
or
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ?

this one: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

For detectnet_v2 and faster_rcnn, it is needed to use nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 .

Usually end users run TAO via tao-launcher.
That means, if you run “tao info --verbose”, or run detectnet_v2 network via "tao detectnet_v2 xxx ", the detctnet_v2 network will auto run with 1.15.4 version.

1 Like

This seem to have solved things. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.