While invoking TAO container directly getting error tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

• Hardware: T4
• Network Type: Detectnet_v2
• TLT Version: nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
• How to reproduce the issue ?
While invoking container directly from here. Link
detectnet_v2 evaluate worked fine.

docker run -it --rm --gpus all -v /home/ubuntu/TAO:/workspace -v /home/ubuntu/TAO/tlt-experiments:/workspace/tao-experiments -v /home/ubuntu/TAO/cv_samples_v1.2.0/facenet/specs:/workspace/tao-experiments/facenet/specs $DOCKER_REGISTRY/$DOCKER_NAME:$DOCKER_TAG detectnet_v2 train -e /workspace/cv_samples_v1.2.0/facenet/specs/facenet_train_resnet18_kitti.txt -r /workspace/results1 -k nvidia_tlt

When I hit the detectnet_v2 train for the same…
getting below error

tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

INFO:tensorflow:Graph was finalized.
2022-03-08 08:32:18,150 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-03-08 08:32:20,061 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-03-08 08:32:20,752 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-03-08 08:32:26,689 [INFO] tensorflow: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.0019400244, step = 0
2022-03-08 08:33:05,784 [INFO] tensorflow: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.0019400244, step = 0
2022-03-08 08:33:05,787 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 0/100: loss: 0.00194 learning rate: 0.00000 Time taken: 0:00:00 ETA: 0:00:00
2022-03-08 08:33:05,787 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.568
INFO:tensorflow:epoch = 0.012406947890818858, learning_rate = 5.0286485e-06, loss = 0.0029743705, step = 5 (6.248 sec)
2022-03-08 08:33:12,032 [INFO] tensorflow: epoch = 0.012406947890818858, learning_rate = 5.0286485e-06, loss = 0.0029743705, step = 5 (6.248 sec)
INFO:tensorflow:epoch = 0.02977667493796526, learning_rate = 5.069036e-06, loss = 0.0015561958, step = 12 (5.919 sec)
2022-03-08 08:33:17,951 [INFO] tensorflow: epoch = 0.02977667493796526, learning_rate = 5.069036e-06, loss = 0.0015561958, step = 12 (5.919 sec)
INFO:tensorflow:epoch = 0.04714640198511166, learning_rate = 5.109743e-06, loss = 0.0038140405, step = 19 (5.992 sec)
2022-03-08 08:33:23,943 [INFO] tensorflow: epoch = 0.04714640198511166, learning_rate = 5.109743e-06, loss = 0.0038140405, step = 19 (5.992 sec)
2022-03-08 08:33:24.255061: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[50314529bd0d:00029] *** Process received signal ***
[50314529bd0d:00029] Signal: Aborted (6)
[50314529bd0d:00029] Signal code:  (-6)
[50314529bd0d:00029] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f3c281f1040]
[50314529bd0d:00029] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3c281f0fb7]
[50314529bd0d:00029] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3c281f2921]
[50314529bd0d:00029] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x85fa784)[0x7f3bcf1c1784]
[50314529bd0d:00029] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f3bceb76507]
[50314529bd0d:00029] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f3bceb76d9f]
[50314529bd0d:00029] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f3bc5e73fa1]
[50314529bd0d:00029] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f3bc5e71608]
[50314529bd0d:00029] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f3c260db6df]
[50314529bd0d:00029] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f3c27f9a6db]
[50314529bd0d:00029] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f3c282d371f]
[50314529bd0d:00029] *** End of error message ***
Aborted (core dumped)

Usually, users can trigger detectnet_v2 training via command “tao detectnet_v2 xxx” .

You are triggering with “docker run xxx”. For this way in detectnet_v2, please use nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 docker.

You can also see more info via “tao info --verbose”.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.