Error when trying to retrain yolo_v4

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type =Yolo_v4
• Training spec file(If have, please share here)
Hello I have an error when I try to retrain the yolo_v4 model, after pruning it.

INFO: Starting Training Loop.
Epoch 1/200
[201af7c0eddd:107  :0:708] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[201af7c0eddd:108  :0:682] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:    682) ====
 0 0x00000000000153c0 __funlockfile()  ???:0
 1 0x000000000749abb0 tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::minimum<float> >::Compute()  ???:0
 2 0x00000000010f2333 tensorflow::BaseGPUDevice::Compute()  ???:0
 3 0x00000000011500b7 tensorflow::(anonymous namespace)::ExecutorState::Process()  executor.cc:0
 4 0x0000000001150723 std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke()  executor.cc:0
 5 0x0000000001205e6d Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop()  ???:0
 6 0x000000000120297c std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke()  ???:0
 7 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
 8 0x0000000000009609 start_thread()  ???:0
 9 0x0000000000122293 clone()  ???:0
=================================
==== backtrace (tid:    708) ====
 0 0x00000000000153c0 __funlockfile()  ???:0
 1 0x0000000007443ae0 tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::maximum<float> >::Compute()  ???:0
 2 0x00000000010f2333 tensorflow::BaseGPUDevice::Compute()  ???:0
 3 0x00000000011500b7 tensorflow::(anonymous namespace)::ExecutorState::Process()  executor.cc:0
 4 0x0000000001150723 std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke()  executor.cc:0
 5 0x0000000001205e6d Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop()  ???:0
 6 0x000000000120297c std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke()  ???:0
 7 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
 8 0x0000000000009609 start_thread()  ???:0
 9 0x0000000000122293 clone()  ???:0
=================================
[201af7c0eddd:00108] *** Process received signal ***
[201af7c0eddd:00108] Signal: Segmentation fault (11)
[201af7c0eddd:00108] Signal code:  (-6)
[201af7c0eddd:00108] Failing at address: 0x6c
[201af7c0eddd:00107] *** Process received signal ***
[201af7c0eddd:00107] Signal: Segmentation fault (11)
[201af7c0eddd:00107] Signal code:  (-6)
[201af7c0eddd:00107] Failing at address: 0x6b
[201af7c0eddd:00108] [ 0] [201af7c0eddd:00107] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f0df2b053c0]
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fe75258c3c0]
[201af7c0eddd:00108] [ 1] [201af7c0eddd:00107] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor7maximumIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7f0d5e25bae0]
[201af7c0eddd:00107] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3d3)[0x7f0ddca60333]
[201af7c0eddd:00107] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x11500b7)[0x7f0ddcabe0b7]
[201af7c0eddd:00107] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1150723)[0x7f0ddcabe723]
[201af7c0eddd:00107] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7f0ddcb73e6d]
[201af7c0eddd:00107] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7f0ddcb7097c]
[201af7c0eddd:00107] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f0de743ade4]
[201af7c0eddd:00107] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f0df2af9609]
[201af7c0eddd:00107] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f0df2c35293]
[201af7c0eddd:00107] *** End of error message ***
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor7minimumIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7fe6bdd39bb0]
[201af7c0eddd:00108] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3d3)[0x7fe73ece8333]
[201af7c0eddd:00108] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x11500b7)[0x7fe73ed460b7]
[201af7c0eddd:00108] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1150723)[0x7fe73ed46723]
[201af7c0eddd:00108] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7fe73edfbe6d]
[201af7c0eddd:00108] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7fe73edf897c]
[201af7c0eddd:00108] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fe746ec1de4]
[201af7c0eddd:00108] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fe752580609]
[201af7c0eddd:00108] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe7526bc293]
[201af7c0eddd:00108] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 201af7c0eddd exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2022-10-26 12:51:35,756 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you share the full training command?

Retraining using the pruned model as pretrained weights

!tao yolo_v4 train --gpus 4
-e $SPECS_DIR/yolo_v4_retrain_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_retrain
-k $KEY

Can you try 1 gpu?

Now I have this:

INFO: Starting Training Loop.
Epoch 1/200
INFO: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[32,29484,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node loss/encoded_detections_loss/sub_19}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_19381]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[32,29484,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node loss/encoded_detections_loss/sub_19}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
ERROR: Ran out of GPU memory, please lower the batch size, use a smaller input resolution, use a smaller backbone, or enable model parallelism for supported TLT architectures (see TLT documentation).
2022-10-27 12:40:48,224 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The problem is, I trained for 200 epochs with batch 32, before pruning and it worked.

You can try a smaller bs.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one.
Thanks

How about the latest status for 1gpu and 4gpus?

For 4 gpus training, is it ok now? If not, please share nvidia-smi and dmesg.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.