TLT with WSL2 possible? Segmentation Fault Error

VisionSystem · June 24, 2021, 7:27pm

Hello,

im using TLT within Windows 10 WSL2 (Ubuntu 20.04). I’ve successfully installed CUDA, Docker, TLT 3.0 on WSL.
I was following the sample-guide for Mask-RCNN, but i always get stuck on step 4 (Training), because i get following message:

[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tlt-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
[23a3b5ac81fc:00077] *** Process received signal ***
[23a3b5ac81fc:00077] Signal: Segmentation fault (11)
[23a3b5ac81fc:00077] Signal code: Address not mapped (1)
[23a3b5ac81fc:00077] Failing at address: 0x10
[23a3b5ac81fc:00077] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7ff7ef71c040]
[23a3b5ac81fc:00077] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen16ThreadPoolDeviceENS_7functor7maximumIfEEE7ComputeEPNS_15OpKernelContextE+0x148)[0x7ff72cbce338]
[23a3b5ac81fc:00077] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf97ab2)[0x7ff726407ab2]
[23a3b5ac81fc:00077] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf853f5)[0x7ff7263f53f5]
[23a3b5ac81fc:00077] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x7282ed1)[0x7ff72e408ed1]
[23a3b5ac81fc:00077] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7ff7264b7791]
[23a3b5ac81fc:00077] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7ff7264b4df8]
[23a3b5ac81fc:00077] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7ff7eca816df]
[23a3b5ac81fc:00077] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7ff7ef4c56db]
[23a3b5ac81fc:00077] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ff7ef7fe71f]
[23a3b5ac81fc:00077] *** End of error message ***
Segmentation fault
2021-06-24 21:14:15,579 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The Program stucks on “Saving checkpoint for 0…” for several minutes and outputs then the above message.
My ~/.tlt_mounts.json looks like this:

{
“Mounts”: [
{
“source”: “/home/clk/tlt”,
“destination”: “/workspace/tlt-experiments”
},
{
“source”: “/home/user/tlt/data”,
“destination”: “/workspace/tlt-experiments/data”
},
{
“source”: “/home/user/tlt/mask_rcnn/experiment_dir_unpruned”,
“destination”: “/workspace/tlt-experiments/results”
},
{
“source”: “/home/user/tlt/mask_rcnn/specs”,
“destination”: “/workspace/tlt-experiments/specs”
}
],
“Envs”: [
{
“variable”: “CUDA_DEVICE_ORDER”,
“value”: “PCI_BUS_ID”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}
}
}

Further Infos:
WSL: 5.4.72-microsoft-standard-WSL2
Distro: Ubuntu 20.04
TLT Version 3.0
docker_tag: v3.0-py3

Anyone archived training via WSL?

Morganh · June 24, 2021, 11:34pm

How about other networks? For example, can you try to train lprnet with its jupyter notebook?

rajiv-singh · June 25, 2021, 3:20am

I have tried using TLT in wsl2 with all the latest updates and drivers…
I explored the Jupyter Notebook for Classification.
In my experience NCCL causes an issue with the training process… even though I am using a single GPU RTX 3090…
I have wasted a week to get this thing to work …but in vain.
Now I have dual boot system so that just for training I need to use Ubuntu.

Since WSL2 and GPU support is in Nascent stage… It would probably take some time for TLT engineers to factor in WSL2 as an option and work on it…

Morganh · June 27, 2021, 6:28am

Yes, it is possible to run TLT in wsl2. I just verify TLT 2.0 detectnet_v2 training in one Geforce RTX 2070 based on Ubuntu 18.04 and 5.10.43.3-microsoft-standard-WSL2.

For TLT in wsl2, firstly, please make sure below CUDA applications work well when you follow this wsl-user-guide (CUDA on WSL :: CUDA Toolkit Documentation)
/usr/local/cuda/samples/4_Finance/BlackScholes
/usr/local/cuda/samples/1_Utilities/deviceQuery

@rajiv-singh ,
Please create a new topic and elaborate your issue.

system · August 26, 2021, 6:28am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TLT 3.0 & WSL2 issues TAO Toolkit nvbugs	7	1317	December 6, 2021
Segmentation fault in cuInit() CUDA on Windows Subsystem for Linux	8	3334	August 4, 2021
Error during TAO training in WSL2 TAO Toolkit	3	83	December 5, 2024
Deepstream 6.1 Segmentation Fault on WSL2 DeepStream SDK tensorrt , ubuntu , gstreamer , deepstream61	2	731	June 15, 2022
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("OCI runtime create failed...) TAO Toolkit ubuntu , docker	51	9227	December 6, 2021
Tlt-train with ssd is not working on the latest container (December 29, 2020) TAO Toolkit	9	688	October 12, 2021
CUDA basic tutorial segmentation fault in WSL2 Ubuntu CUDA on Windows Subsystem for Linux	2	3067	April 12, 2022
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	680	August 27, 2021
Segmentation fault at training network Jetson TX2 ai-training	6	2719	September 5, 2021
Tensor Core Usage on WSL2 with RTX 3080 Laptop GPU CUDA on Windows Subsystem for Linux	1	2219	February 27, 2022

TLT with WSL2 possible? Segmentation Fault Error

Related topics