YOLO V4 not training

bhargavi.sanadhya · June 17, 2021, 5:01pm

while training yolo v4 on 1 GPU the following error is coming

[ef112f939b52:56258] *** Process received signal ***
[ef112f939b52:56258] Signal: Segmentation fault (11)
[ef112f939b52:56258] Signal code: Address not mapped (1)
[ef112f939b52:56258] Failing at address: 0x10
[ef112f939b52:56258] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f70577c4040]
[ef112f939b52:56258] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDeviceENS_7functor3addIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7f6f6a50ff90]
[ef112f939b52:56258] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x522)[0x7f6f642fb382]
[ef112f939b52:56258] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf978ab)[0x7f6f6435c8ab]
[ef112f939b52:56258] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf97c6f)[0x7f6f6435cc6f]
[ef112f939b52:56258] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f6f6440c791]
[ef112f939b52:56258] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f6f64409df8]
[ef112f939b52:56258] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f70556c86df]
[ef112f939b52:56258] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f705756d6db]
[ef112f939b52:56258] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f70578a671f]
[ef112f939b52:56258] *** End of error message ***
Segmentation fault (core dumped)
Traceback (most recent call last):
File “/usr/local/bin/yolo_v4”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.

NVES · June 17, 2021, 5:07pm

Hi,
We recommend you to check the below samples links, as they might answer your concern

If issue persist, request you to share the model and script so that we can try reproducing the issue at our end.
Thanks!

spolisetty · June 18, 2021, 12:28pm

Hi @bhargavi.sanadhya,

We request you to please share more details. Based on the information you’ve provided it doesn’t look like TensorRT related issue.
We recommend you to post your concern on related platform to get better help.

Thank you.