Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type =Yolo_v4
• Training spec file(If have, please share here)
Hello I have an error when I try to retrain the yolo_v4 model, after pruning it.
INFO: Starting Training Loop.
Epoch 1/200
[201af7c0eddd:107 :0:708] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[201af7c0eddd:108 :0:682] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 682) ====
0 0x00000000000153c0 __funlockfile() ???:0
1 0x000000000749abb0 tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::minimum<float> >::Compute() ???:0
2 0x00000000010f2333 tensorflow::BaseGPUDevice::Compute() ???:0
3 0x00000000011500b7 tensorflow::(anonymous namespace)::ExecutorState::Process() executor.cc:0
4 0x0000000001150723 std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke() executor.cc:0
5 0x0000000001205e6d Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop() ???:0
6 0x000000000120297c std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke() ???:0
7 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
8 0x0000000000009609 start_thread() ???:0
9 0x0000000000122293 clone() ???:0
=================================
==== backtrace (tid: 708) ====
0 0x00000000000153c0 __funlockfile() ???:0
1 0x0000000007443ae0 tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::maximum<float> >::Compute() ???:0
2 0x00000000010f2333 tensorflow::BaseGPUDevice::Compute() ???:0
3 0x00000000011500b7 tensorflow::(anonymous namespace)::ExecutorState::Process() executor.cc:0
4 0x0000000001150723 std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke() executor.cc:0
5 0x0000000001205e6d Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop() ???:0
6 0x000000000120297c std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke() ???:0
7 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
8 0x0000000000009609 start_thread() ???:0
9 0x0000000000122293 clone() ???:0
=================================
[201af7c0eddd:00108] *** Process received signal ***
[201af7c0eddd:00108] Signal: Segmentation fault (11)
[201af7c0eddd:00108] Signal code: (-6)
[201af7c0eddd:00108] Failing at address: 0x6c
[201af7c0eddd:00107] *** Process received signal ***
[201af7c0eddd:00107] Signal: Segmentation fault (11)
[201af7c0eddd:00107] Signal code: (-6)
[201af7c0eddd:00107] Failing at address: 0x6b
[201af7c0eddd:00108] [ 0] [201af7c0eddd:00107] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f0df2b053c0]
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fe75258c3c0]
[201af7c0eddd:00108] [ 1] [201af7c0eddd:00107] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor7maximumIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7f0d5e25bae0]
[201af7c0eddd:00107] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3d3)[0x7f0ddca60333]
[201af7c0eddd:00107] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x11500b7)[0x7f0ddcabe0b7]
[201af7c0eddd:00107] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1150723)[0x7f0ddcabe723]
[201af7c0eddd:00107] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7f0ddcb73e6d]
[201af7c0eddd:00107] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7f0ddcb7097c]
[201af7c0eddd:00107] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f0de743ade4]
[201af7c0eddd:00107] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f0df2af9609]
[201af7c0eddd:00107] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f0df2c35293]
[201af7c0eddd:00107] *** End of error message ***
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor7minimumIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7fe6bdd39bb0]
[201af7c0eddd:00108] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3d3)[0x7fe73ece8333]
[201af7c0eddd:00108] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x11500b7)[0x7fe73ed460b7]
[201af7c0eddd:00108] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1150723)[0x7fe73ed46723]
[201af7c0eddd:00108] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7fe73edfbe6d]
[201af7c0eddd:00108] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7fe73edf897c]
[201af7c0eddd:00108] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fe746ec1de4]
[201af7c0eddd:00108] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fe752580609]
[201af7c0eddd:00108] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe7526bc293]
[201af7c0eddd:00108] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 201af7c0eddd exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2022-10-26 12:51:35,756 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.