Hello,
im using TLT within Windows 10 WSL2 (Ubuntu 20.04). I’ve successfully installed CUDA, Docker, TLT 3.0 on WSL.
I was following the sample-guide for Mask-RCNN, but i always get stuck on step 4 (Training), because i get following message:
[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tlt-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
[23a3b5ac81fc:00077] *** Process received signal ***
[23a3b5ac81fc:00077] Signal: Segmentation fault (11)
[23a3b5ac81fc:00077] Signal code: Address not mapped (1)
[23a3b5ac81fc:00077] Failing at address: 0x10
[23a3b5ac81fc:00077] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7ff7ef71c040]
[23a3b5ac81fc:00077] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen16ThreadPoolDeviceENS_7functor7maximumIfEEE7ComputeEPNS_15OpKernelContextE+0x148)[0x7ff72cbce338]
[23a3b5ac81fc:00077] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf97ab2)[0x7ff726407ab2]
[23a3b5ac81fc:00077] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf853f5)[0x7ff7263f53f5]
[23a3b5ac81fc:00077] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x7282ed1)[0x7ff72e408ed1]
[23a3b5ac81fc:00077] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7ff7264b7791]
[23a3b5ac81fc:00077] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7ff7264b4df8]
[23a3b5ac81fc:00077] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7ff7eca816df]
[23a3b5ac81fc:00077] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7ff7ef4c56db]
[23a3b5ac81fc:00077] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ff7ef7fe71f]
[23a3b5ac81fc:00077] *** End of error message ***
Segmentation fault
2021-06-24 21:14:15,579 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
The Program stucks on “Saving checkpoint for 0…” for several minutes and outputs then the above message.
My ~/.tlt_mounts.json looks like this:
{
“Mounts”: [
{
“source”: “/home/clk/tlt”,
“destination”: “/workspace/tlt-experiments”
},
{
“source”: “/home/user/tlt/data”,
“destination”: “/workspace/tlt-experiments/data”
},
{
“source”: “/home/user/tlt/mask_rcnn/experiment_dir_unpruned”,
“destination”: “/workspace/tlt-experiments/results”
},
{
“source”: “/home/user/tlt/mask_rcnn/specs”,
“destination”: “/workspace/tlt-experiments/specs”
}
],
“Envs”: [
{
“variable”: “CUDA_DEVICE_ORDER”,
“value”: “PCI_BUS_ID”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}
}
}
Further Infos:
WSL: 5.4.72-microsoft-standard-WSL2
Distro: Ubuntu 20.04
TLT Version 3.0
docker_tag: v3.0-py3
Anyone archived training via WSL?