TLT with WSL2 possible? Segmentation Fault Error

Hello,

im using TLT within Windows 10 WSL2 (Ubuntu 20.04). I’ve successfully installed CUDA, Docker, TLT 3.0 on WSL.
I was following the sample-guide for Mask-RCNN, but i always get stuck on step 4 (Training), because i get following message:

[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tlt-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
[23a3b5ac81fc:00077] *** Process received signal ***
[23a3b5ac81fc:00077] Signal: Segmentation fault (11)
[23a3b5ac81fc:00077] Signal code: Address not mapped (1)
[23a3b5ac81fc:00077] Failing at address: 0x10
[23a3b5ac81fc:00077] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7ff7ef71c040]
[23a3b5ac81fc:00077] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen16ThreadPoolDeviceENS_7functor7maximumIfEEE7ComputeEPNS_15OpKernelContextE+0x148)[0x7ff72cbce338]
[23a3b5ac81fc:00077] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf97ab2)[0x7ff726407ab2]
[23a3b5ac81fc:00077] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0xf853f5)[0x7ff7263f53f5]
[23a3b5ac81fc:00077] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x7282ed1)[0x7ff72e408ed1]
[23a3b5ac81fc:00077] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7ff7264b7791]
[23a3b5ac81fc:00077] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7ff7264b4df8]
[23a3b5ac81fc:00077] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7ff7eca816df]
[23a3b5ac81fc:00077] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7ff7ef4c56db]
[23a3b5ac81fc:00077] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ff7ef7fe71f]
[23a3b5ac81fc:00077] *** End of error message ***
Segmentation fault
2021-06-24 21:14:15,579 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The Program stucks on “Saving checkpoint for 0…” for several minutes and outputs then the above message.
My ~/.tlt_mounts.json looks like this:

{
“Mounts”: [
{
“source”: “/home/clk/tlt”,
“destination”: “/workspace/tlt-experiments”
},
{
“source”: “/home/user/tlt/data”,
“destination”: “/workspace/tlt-experiments/data”
},
{
“source”: “/home/user/tlt/mask_rcnn/experiment_dir_unpruned”,
“destination”: “/workspace/tlt-experiments/results”
},
{
“source”: “/home/user/tlt/mask_rcnn/specs”,
“destination”: “/workspace/tlt-experiments/specs”
}
],
“Envs”: [
{
“variable”: “CUDA_DEVICE_ORDER”,
“value”: “PCI_BUS_ID”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}
}
}

Further Infos:
WSL: 5.4.72-microsoft-standard-WSL2
Distro: Ubuntu 20.04
TLT Version 3.0
docker_tag: v3.0-py3

Anyone archived training via WSL?

How about other networks? For example, can you try to train lprnet with its jupyter notebook?

I have tried using TLT in wsl2 with all the latest updates and drivers…
I explored the Jupyter Notebook for Classification.
In my experience NCCL causes an issue with the training process… even though I am using a single GPU RTX 3090…
I have wasted a week to get this thing to work …but in vain.
Now I have dual boot system so that just for training I need to use Ubuntu.

Since WSL2 and GPU support is in Nascent stage… It would probably take some time for TLT engineers to factor in WSL2 as an option and work on it…

2 Likes

Yes, it is possible to run TLT in wsl2. I just verify TLT 2.0 detectnet_v2 training in one Geforce RTX 2070 based on Ubuntu 18.04 and 5.10.43.3-microsoft-standard-WSL2.

For TLT in wsl2, firstly, please make sure below CUDA applications work well when you follow this wsl-user-guide (CUDA on WSL :: CUDA Toolkit Documentation)
/usr/local/cuda/samples/4_Finance/BlackScholes
/usr/local/cuda/samples/1_Utilities/deviceQuery

@rajiv-singh ,
Please create a new topic and elaborate your issue.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.