Hi,
For the past week I have been trying to train a dataset with detectnet_v2 and resnet18 backbone, downloaded from NGC. Nearly, four times the training got shut in between the number of epochs. I specified in the config 200 epochs. In all cases, the machine restarted and I could not note down the error. I am running the latest docker container instance downloaded from NGC.
System specs: GPU - 2080 Ti RTX (11 GB RAM), Ubuntu 18.04, CUDA 10.1, Driver 430.50, TLT: Docker image streamanalytics v2.0, Detectnet_v2.
Dataset: 55000 images with annotations, Train:Val = 80:20.
I have two queries:
(a) Why did the tlt-train end abruptly and the machine restarted (I am using a batch-size of 32)? I have also tried with batch-size 16, it restarted after a few epochs.
(b) As stated in the release, I did not see the training resume from the checkpoint where it got abruptly ended. I did not make any changes to the spec files. Just ran the same command, and the training again restarted from epoch 0. Hence, resuming the training from previous checkpoint is not happening automatically. Is there any option, which we need to set explicitly in the spec files ?
The last failure, I was able to take down the error. It is pasted below.
[e1e92165c1ab:01551] Signal: Aborted (6)
[e1e92165c1ab:01551] Signal code: (-6)
[e1e92165c1ab:01551] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fb7ce4c4390]
[e1e92165c1ab:01551] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fb7ce11e428]
[e1e92165c1ab:01551] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fb7ce12002a]
[e1e92165c1ab:01551] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x777ea)[0x7fb7ce1607ea]
[e1e92165c1ab:01551] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fb7ce16937a]
[e1e92165c1ab:01551] [ 5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fb7ce16d53c]
[e1e92165c1ab:01551] [ 6] /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZNSt14_Function_base13_Base_managerIZN10tensorflow10WhereGPUOpIbE16ComputeAsyncTypeIiEEvRKNS1_6TensorEiPNS1_15OpKernelContextESt8functionIFvvEEEUlvE_E10_M_managerERSt9_Any_dataRKSF_St18_Manager_operation+0x7d)[0x7fb77ac6c50d]
[e1e92165c1ab:01551] [ 7] /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr11ThenExecuteEPN15stream_executor6StreamESt8functionIFvvEE+0x14c)[0x7fb77a72ef3c]
[e1e92165c1ab:01551] [ 8] /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10WhereGPUOpIbE16ComputeAsyncTypeIiEEvRKNS_6TensorEiPNS_15OpKernelContextESt8functionIFvvEE+0x744)[0x7fb77ac83554]
[e1e92165c1ab:01551] [ 9] /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10WhereGPUOpIbE12ComputeAsyncEPNS_15OpKernelContextESt8functionIFvvEE+0x9a)[0x7fb77ac8659a]
[e1e92165c1ab:01551] [10] /usr/local/lib/python2.7/dist-packages/tensorflow/python/…/libtensorflow_framework.so(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0x184)[0x7fb776f6da04]
[e1e92165c1ab:01551] [11] /usr/local/lib/python2.7/dist-packages/tensorflow/python/…/libtensorflow_framework.so(+0x722d5d)[0x7fb776fb8d5d]
[e1e92165c1ab:01551] [12] /usr/local/lib/python2.7/dist-packages/tensorflow/python/…/libtensorflow_framework.so(+0x72374a)[0x7fb776fb974a]
[e1e92165c1ab:01551] [13] /usr/local/lib/python2.7/dist-packages/tensorflow/python/…/libtensorflow_framework.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x306)[0x7fb77702adc6]
[e1e92165c1ab:01551] [14] /usr/local/lib/python2.7/dist-packages/tensorflow/python/…/libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x44)[0x7fb777029c84]
[e1e92165c1ab:01551] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fb7692abc80]
[e1e92165c1ab:01551] [16] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fb7ce4ba6ba]
[e1e92165c1ab:01551] [17] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb7ce1f041d]
[e1e92165c1ab:01551] *** End of error message ***
/usr/local/bin/tlt-train: line 32: 1551 Aborted (core dumped) tlt-train-g1 ${PYTHON_ARGS[*]}
I followed the solution provided online (github of tensorflow) to set the following:
sudo apt-get install libtcmalloc-minimal4
sudo apt-get install google-perftools
export LD_PRELOAD=“/usr/lib/libtcmalloc_minimal.so.4”
Any response, would be quite helpful.