Training WSL 2 CUDA hangs over several training steps

Cchivriga · April 25, 2021, 8:05am

I have an issue with WSL 2 using latest CUDA driver 470.25 having one Nvidia RTX 2060, when training Deep Learning models I am always forced to run synchronous computation using the latest PyTorch version. Like so CUDA_LAUNCH_BLOCKING=1 python train.py --config ../config/default.yaml. I suppose some sort-of deadlock occurs when running asynchronously such as omitting CUDA_LAUNCH_BLOCKING=1 which never ever occurs when using a different platform to do my trainings (e.g. Colab, 2nd computer). The training just hangs after approx. 0-100 training steps thus the process hangs (zombie) and I can’t kill it properly. Using nvidia-smi outputs locked memory usage (CUDA memory is indeed still allocated) but with gradual decline of GPU-Util (over 10min).

Cchivriga · April 26, 2021, 11:03pm

May anyone help me?

AakankshaS · April 27, 2021, 4:47am

Hi @Cchivriga ,
Can you pls share the process and GPU usage stats as well. Also if you can help us with reproducible script and model to assist you better.

Thanks

Cchivriga · April 27, 2021, 8:21pm

@AakankshaS

Hi, I’ve tried several things, the most simple example when it hangs:

import torch
import torch.nn as nn


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.sec1 = nn.Sequential(
            nn.Conv2d(1, 32, (41, 11), stride = (2, 1), padding = (20, 5)),
            nn.BatchNorm2d(32),
            nn.LeakyReLU(0.01)
        )

        self.sec2 = nn.Sequential(
            nn.Conv2d(32, 32, kernel_size = (21, 11), stride = (2, 1), padding = (10, 5)),
            nn.BatchNorm2d(32),
            nn.LeakyReLU(0.01)
        )

        self.gru = nn.GRU(input_size = 1312, hidden_size = 512, num_layers = 3, batch_first = True, bidirectional = False)
        self.fc1 = nn.Linear(512, 26)

    def forward(self, x):
        batch_size = x.shape[0]

        x = self.sec1(x)
        x = self.sec2(x)

        x = x.transpose(1, 3)
        x = x.contiguous().view(batch_size, -1, x.size(2) * x.size(3))

        x, _ = self.gru(x)

        x = self.fc1(x)

        x = x.transpose(0, 1).log_softmax(dim = 2)

        return x


device = "cuda"

x = torch.randn((12, 1, 161, 1200)).to(device)
y = torch.randn((1200, 12, 26)).to(device)

model = Model()
model = model.to(device)

criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr = 1e-4)
for i in range(200):
    optimizer.zero_grad()
    y_pred = model(x)

    loss = criterion(y_pred, y)
    print(f"Step {i} loss: {loss.item()}")

    loss.backward()
    optimizer.step()

Output from the script (until it hangs):

Step 0 loss: 11.632630348205566
Step 1 loss: 11.632563591003418
Step 2 loss: 11.632675170898438
Step 3 loss: 11.646272659301758
Step 4 loss: 11.646001815795898
Step 5 loss: 11.646049499511719
Step 6 loss: 11.645902633666992
Step 7 loss: 11.645750045776367
Step 8 loss: 11.645718574523926
Step 9 loss: 11.645608901977539
Step 10 loss: 11.645499229431152
Step 11 loss: 11.645389556884766
Step 12 loss: 11.645282745361328
Step 13 loss: 11.645173072814941
Step 14 loss: 11.644946098327637
Step 15 loss: 11.644959449768066
Step 16 loss: 11.644851684570312
Step 17 loss: 11.644745826721191
Step 18 loss: 11.644639015197754
Step 19 loss: 11.64453411102295
Step 20 loss: 11.644428253173828
Step 21 loss: 11.644325256347656
Step 22 loss: 11.644220352172852
Step 23 loss: 11.64393138885498
Step 24 loss: 11.644011497497559
Step 25 loss: 11.644028663635254
Step 26 loss: 11.643806457519531
Step 27 loss: 11.64370346069336
Step 28 loss: 11.643600463867188
Step 29 loss: 11.643498420715332
Step 30 loss: 11.643651962280273
Step 31 loss: 11.64329719543457
Step 32 loss: 11.643196105957031
Step 33 loss: 11.643089294433594

GPU logs (middle part - is the training process until it hangs):

PS> nvidia-smi dmon -i 0
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0     8    36     -    14    16     0     0   190    75
    0     8    36     -    17    17     0     0   203    80
    0     9    36     -    15    16     0     0   199    78
    0    26    37     -    16    14     0     0   907   222
    0    26    38     -     7     1     0     0  6800  1376
    0    26    38     -     8     1     0     0  6794  1379
    0    27    38     -     8     1     0     0  6794  1379
    0   145    46     -    79    25     0     0  6794  1624
    0   135    48     -    97    36     0     0  6794  1867
    0   154    50     -    97    39     0     0  6794  1842
    0   157    50     -    97    27     0     0  6794  1849
    0   138    52     -    97    37     0     0  6794  1849
    0   155    52     -    97    39     0     0  6794  1840
    0   133    53     -    97    26     0     0  6794  1846
    0   137    54     -    92    36     0     0  6794  1854
    0   156    55     -    98    35     0     0  6794  1830
    0   152    55     -    99    29     0     0  6794  1842
    0   132    55     -    98    39     0     0  6794  1832
    0   156    57     -    97    36     0     0  6794  1832
    0   152    57     -    98    28     0     0  6794  1837
    0   143    58     -    96    38     0     0  6794  1831
    0   153    59     -    96    33     0     0  6794  1822
    0   161    58     -    97    30     0     0  6794  1843
    0   135    60     -    98    39     0     0  6794  1833
    0   154    60     -    98    33     0     0  6794  1821
    0   159    60     -    98    31     0     0  6794  1831
    0   131    60     -    98    38     0     0  6794  1816
    0    44    56     -    93    30     0     0  6794  1813
    0    42    54     -     3     1     0     0  6794  1910
    0    27    52     -     2     1     0     0  6794  1847
    0    27    51     -     3     1     0     0  6794  1375
    0    27    50     -     3     1     0     0  6794  1375
    0    27    50     -     3     1     0     0  6794  1373
    0    27    49     -     3     1     0     0  6794  1373
    0    27    49     -     3     1     0     0  6794  1372
    0    27    48     -     3     1     0     0  6794  1374
    0    27    48     -     3     1     0     0  6794  1374
    0    27    47     -     3     1     0     0  6794  1374
    0    26    47     -     3     1     0     0  6794  1375
    0    26    47     -     3     1     0     0  6794  1375
    0    26    46     -     3     1     0     0  6794  1374
    0    26    46     -     3     1     0     0  6794  1374

Output from nvidia-smi (after 5min it hangs, memory is still allocated, repeating the command shows memory fluctuating +/- 300MiB):

PS> nvidia-smi
Tue Apr 27 22:10:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.14       Driver Version: 470.14       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:07:00.0  On |                  N/A |
| 40%   38C    P2    25W / 170W |   3688MiB /  6144MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

GPU devices nvidia-smi -L:
GPU 0: NVIDIA GeForce RTX 2060 (UUID: GPU-a6a48f53-2838-7b04-e98e-e0265c3310f8)

Running ps -aux:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
.....
domainf+  1039 99.6 35.3 30013532 2863236 pts/0 Rl+ 22:03  19:11 python sample_seq.py

Checking the task manager:

GPU 0
NVIDIA GeForce RTX 2060

Driver version:	27.21.14.7014
Driver date:	17-Mar-21
DirectX version:	12 (FL 12.1)
Physical location:	PCI bus 7, device 0, function 0

Utilization	3%
Dedicated GPU memory	3.7/6.0 GB
Shared GPU memory	0.1/8.0 GB
GPU Memory	3.8/14.0 GB

zeronixspin · April 28, 2021, 12:37pm

I am facing a similar problem while training a DNN with Pytorch on CUDA enabled WSL2 Ubuntu 18.04 (using Jupyterlab). Sometimes restarting the whole system helps. As a workaround, I save the parameters and resume training after a system restart.

Cchivriga · April 30, 2021, 9:51pm

Any updates?

I forgot to say that I’ve tested with multiple PyTorch versions (e.g. 1.8.*, 1.7.*, 1.4.*) having the same issue, using WSL 2 with Windows Insider Program version 21370 (downgraded as well to check - still having the same issue), also tried to underclock (Memory/Core Clock) with MSI afterburner but without much success.

Cchivriga · May 11, 2021, 3:52pm

@AakankshaS any updates? Eventually I moved everything to Windows and having absolutely no issue.

anon19468659 · July 27, 2021, 12:04pm

The same problem on newest drivers for me

kaireuier · September 7, 2021, 1:03pm

I have the exact same issue using multiple CUDA/pytorch versions (currently 1.9.0/wsl 5.10.43/driver 150.06 with 1x 3090), easily reproducible and frustratingly hard to find the solution. Always hangs on optimizer.step() for me.

yukikazehe · September 16, 2021, 2:49am

Hi, I got the same problem with the newest 510.06 driver on Windows 11.
It hangs after from some minutes to about 1 hour.
I tried to reinstall the driver and nothing changed.
Can anyone help?

mcmordie · September 18, 2021, 11:59pm

Same problem here with a P1000, latest Windows 11 insider build 22458.1000 / WSL2 Ubuntu 20.04 LTS and driver and 510.06.

francois.remy · September 20, 2021, 8:03pm

I am noticing a similar issue with Windows 11 and the latest insider build (2458). My driver is 30.0.15.1010 (aka 510.10?).
Given this build is “near final”, this is somewhat concerning.

yukikazehe · September 27, 2021, 6:01pm

Hi guys, I think I may find a solution.

I got my model successfully trained in about 4 hours with my 3070 card
Windows 11 dev channel 22463.1000 (previously I used 22458.1000)
WSL2 Ubuntu 20.04 LTS/ 5.10.43 kernel
Mainstream Game Ready driver 472.12 instead of the dev driver for wsl
CUDA 11.4 with cuDNN 8.2.2

I don’t know whether the windows update or the new NVIDIA driver solved the problem in my case, and I didn’t test for models that need a longer time to be trained. This may not be a final solution to this issue but at least the training process does not hang in just an hour now. :)

tjonker · October 5, 2021, 8:36am

Hello,
I’m running the current windows 11 version 22000.194.
Using the test program from Cchivriga (see above) I switch back my windows driver from 510.06 to 472.12 with no success (program hangs after a few iterations).
To-day (October 5) there was a kernel update from 5.10.43.3-microsoft-standard-WSL2 to 5.10.60.1-microsoft-standard-WSL2 and the program succeeds (also using 472.12)

rboissel · October 7, 2021, 5:13pm

Hello all,

As pointed out by tjonker above and on that issue on github: CUDA on WSL hangs after ~1h training · Issue #7443 · microsoft/WSL (github.com) , there is a new kernel available with that might address this hang. Make sure to run “wsl --update” and then “uname -a” to check your kernel (it should be 5.10.60.1).

Thanks,

Topic		Replies	Views
Failure to install CUDA on WSL 2 Ubuntu CUDA on Windows Subsystem for Linux	65	46529	September 10, 2021
GPU stuck during deep learning training cuDNN	4	1651	March 13, 2020
Bug: Ubuntu on WSL2 - RTX4090 related cuFFT runtime error CUDA on Windows Subsystem for Linux cuda , wsl	12	6347	February 8, 2023
470.14 - WSL with W10 Build 21343 - NVIDIA-SMI error CUDA on Windows Subsystem for Linux	43	18997	November 21, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA on Windows Subsystem for Linux	33	23122	May 1, 2021
Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning CUDA Programming and Performance	14	5940	February 20, 2023
When WSL is faster than Windows?! CUDA on Windows Subsystem for Linux	21	4756	July 25, 2022
Nvidia-smi can't communicate with driver -- docker-desktop conflict? CUDA on Windows Subsystem for Linux cuda , wsl	3	2612	April 10, 2023
Quad (4x) A6000 WSL2 CUDA Init Errors CUDA on Windows Subsystem for Linux	11	2803	November 29, 2024
New CUDA on WSL2 WIP driver 470.76 is now available for download! CUDA on Windows Subsystem for Linux	6	3724	July 29, 2021

Training WSL 2 CUDA hangs over several training steps

Related topics