Training Crashing with Classification Image Pipeline

Hi,

The Training is crashing with ClassificationImagePipeline while its working fine with ClassificationImagePipelineWithCache.

For reference i have attached config_train file along with a custom data loader that was written.

config_train.json (6.7 KB)
clara_data_loader.py (1.9 KB)

> This epoch: 29.89s; per epoch: 38.74s; elapsed: 154.95s; remaining: 38583.79s; best metric: 0.8047948862693924 at epoch 1
> [ENG-AI8-DDC1:2449 :0:2495] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fdecffdc000)
> [ENG-AI8-DDC1:2449 :1:2498] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f)
> [ENG-AI8-DDC1:2449 :2:2494] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
> ==== backtrace ====
> ==== backtrace ====
> ==== backtrace ====
>     0  /usr/local/ucx/lib/libucs.so.0(+0x22d7c) [0x7fe093756d7c]
>     1  /usr/local/ucx/lib/libucs.so.0(+0x22ff4) [0x7fe093756ff4]
>     2  /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197) [0x7fe1104f9277]
>     3  /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x18) [0x7fe09ebd5298]
>     4  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x80032a5) [0x7fe0af82a2a5]
>     5  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xfee824) [0x7fe0a6b08824]
>     6  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf37622) [0x7fe0a6a51622]
>     7  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x13a) [0x7fe0b06ab0fa]
>     8  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f) [0x7fe0b06aba4f]
>     9  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281) [0x7fe0a6b58941]
>    10  /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48) [0x7fe0a6b55fa8]
>    11  /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe09ebff6df]
>    12  /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe11024a6db]
>    13  /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe110583a3f]
> ===================
> ./train.sh: line 20:  2449 Segmentation fault      (core dumped) python -u -m nvmidl.apps.train -m $MMAR_ROOT -c $CONFIG_FILE -e $ENVIRONMENT_FILE --set epochs=1000

Blockquote

The backtrace showed /usr/local/ucx and it seemed there were three processes (2495, 2494, 2498). May I know if you were using multi-gpu training with mpirun with ucx? Our recommended way to run multi-gpu training is through this command:
mpirun -np 2 -H localhost:2 -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --allow-run-as-root \
python3 -u -m nvmidl.apps.train

We released Clara Train v4.0. Therefore, Clara Train 3.1 is our last release based on TensorFlow, we are in maintenance mode for 3.1. If you can migrate to PyTorch, please upgrade to latest Clara Train 4.0 which is based on open-source framework MONAI.

Thanks for your reply.

I am using single GPU for the operation.

I have attached a detail summary of the error.
Based on my experiments, the error is encountered whenever number of workers > 1 and pre-fetch is > 1.
When both are set to 1, the training is not crashing.

Please let me know if there is a fix for this.

Fatal Python error: GC object already tracked

 

Thread 0x00007f5663fff700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait

 

Thread 0x00007f566ffff700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 415 in _handle_tasks
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr[ENG-AI8-DDC1:732  :0:787] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace ====
malloc(): memory corruption
/lib/python3.6/threadi[ENG-AI8-DDC1:00732] *** Process received signal ***
[ENG-AI8-DDC1:00732] Signal: Aborted (6)
[ENG-AI8-DDC1:00732] Signal code:  (-6)
ng.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

 

Thread 0x00007f567bfff700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 406 in _handle_workers
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

 

Thread 0x00007f5687fff700 (most recent call first):

 

Current thread 0x00007f5693fff700 (most recent call first):

 

Thread 0x00007f569ffff700 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/SimpleITK/SimpleITK.py", line 8614 in ReadImage

 

Thread 0x00007f56abfff700 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/SimpleITK/Simple./train.sh: line 20:   732 Segmentation fault      (core dumped) python -u -m nvmidl.apps.train -m $MMAR_ROOT -c $CONFIG_FILE -e $ENVIRONMENT_FILE --set epochs=1000

Update:

It crashed for number of workers as 1 too.

Epoch: 56/1000, Iter: 1347/1347 [====================]  train_accuracy: 0.9995  train_loss: 0.0009  time: 1.55s
This epoch: 2122.78s; per epoch: 2140.96s; elapsed: 119893.78s; remaining: 2021066.61s; best metric: 0.8224931838495732 at epoch 1
[ENG-AI8-DDC1:1152 :0:1668] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace ====
./train.sh: line 20:  1152 Segmentation fault      (core dumped) python -u -m nvmidl.apps.train -m $MMAR_ROOT -c $CONFIG_FILE -e $ENVIRONMENT_FILE --set epochs=1000

Hi Nvidia Team,
I am working with @araj . We want to use Clara 3.1 for sometime in future at least. Moving to PyTorch is not an option right now. It would be great if some help can be given. More importantly, it will also give us confidence to move to 4.0 in future. As of now, even running the examples, throws surprises.
Hope, I could convey my concern.
Best regards,
Krishnan