Hello,
My problem is the continuation as CLI update - #12 by Morganh
Wheen I start training, I get a crash at a random epoch (I tried twice).
!tao faster_rcnn train --gpu_index $GPU_INDEX -e $SPECS_DIR/default_spec_resnet18.txt
Epoch 218/300
48/52 [==========================>…] - ETA: 2s - loss: 0.5245 - rpn_out_class_loss: 0.0188 - rpn_out_regress_loss: 0.0085 - dense_class_td_loss: 0.0536 - dense_regress_td_loss: 0.05342022-06-23 16:41:22.433076: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[72cd7c1307dc:00054] *** Process received signal ***
[72cd7c1307dc:00054] Signal: Aborted (6)
[72cd7c1307dc:00054] Signal code: (-6)
[72cd7c1307dc:00054] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f3f4842c210]
[72cd7c1307dc:00054] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f3f4842c18b]
[72cd7c1307dc:00054] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f3f4840b859]
[72cd7c1307dc:00054] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xc1b1788)[0x7f3eec1ba788]
[72cd7c1307dc:00054] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x235cb2a)[0x7f3ee2365b2a]
[72cd7c1307dc:00054] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xbb)[0x7f3ee9d612db]
[72cd7c1307dc:00054] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7f3edf41ce6d]
[72cd7c1307dc:00054] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7f3edf41997c]
[72cd7c1307dc:00054] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f3f4774fde4]
[72cd7c1307dc:00054] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f3f483cc609]
[72cd7c1307dc:00054] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f3f48508293]
[72cd7c1307dc:00054] *** End of error message ***
2022-06-23 18:41:23,295 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | N/A |
| 0% 37C P8 14W / 170W | 223MiB / 12288MiB | 3% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 918 G /usr/lib/xorg/Xorg 94MiB |
| 0 N/A N/A 1227 G /usr/bin/gnome-shell 24MiB |
| 0 N/A N/A 3579 G /usr/lib/firefox/firefox 102MiB |
±----------------------------------------------------------------------------+
tao info --verbose
Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
v3.22.05-tf1.15.5-py3:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification
4. dssd
5. faster_rcnn
6. emotionnet
7. efficientdet
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. yolo_v4_tiny
21. converter
v3.22.05-tf1.15.4-py3:
docker_registry: nvcr.io
tasks:
1. detectnet_v2
nvidia/tao/tao-toolkit-pyt:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. speech_to_text_conformer
4. action_recognition
5. pointpillars
6. pose_classification
7. spectro_gen
8. vocoder
v3.21.11-py3:
docker_registry: nvcr.io
tasks:
1. text_classification
2. question_answering
3. token_classification
4. intent_slot_classification
5. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022
Do you have any idea, and what the error code “6” refers to ?
Thank you for your help.