Training WSL 2 CUDA hangs over several training steps

I have an issue with WSL 2 using latest CUDA driver 470.25 having one Nvidia RTX 2060, when training Deep Learning models I am always forced to run synchronous computation using the latest PyTorch version. Like so CUDA_LAUNCH_BLOCKING=1 python train.py --config ../config/default.yaml. I suppose some sort-of deadlock occurs when running asynchronously such as omitting CUDA_LAUNCH_BLOCKING=1 which never ever occurs when using a different platform to do my trainings (e.g. Colab, 2nd computer). The training just hangs after approx. 0-100 training steps thus the process hangs (zombie) and I can’t kill it properly. Using nvidia-smi outputs locked memory usage (CUDA memory is indeed still allocated) but with gradual decline of GPU-Util (over 10min).

May anyone help me?

Hi @Cchivriga ,
Can you pls share the process and GPU usage stats as well. Also if you can help us with reproducible script and model to assist you better.

Thanks

1 Like

@AakankshaS

Hi, I’ve tried several things, the most simple example when it hangs:

import torch
import torch.nn as nn


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.sec1 = nn.Sequential(
            nn.Conv2d(1, 32, (41, 11), stride = (2, 1), padding = (20, 5)),
            nn.BatchNorm2d(32),
            nn.LeakyReLU(0.01)
        )

        self.sec2 = nn.Sequential(
            nn.Conv2d(32, 32, kernel_size = (21, 11), stride = (2, 1), padding = (10, 5)),
            nn.BatchNorm2d(32),
            nn.LeakyReLU(0.01)
        )

        self.gru = nn.GRU(input_size = 1312, hidden_size = 512, num_layers = 3, batch_first = True, bidirectional = False)
        self.fc1 = nn.Linear(512, 26)

    def forward(self, x):
        batch_size = x.shape[0]

        x = self.sec1(x)
        x = self.sec2(x)

        x = x.transpose(1, 3)
        x = x.contiguous().view(batch_size, -1, x.size(2) * x.size(3))

        x, _ = self.gru(x)

        x = self.fc1(x)

        x = x.transpose(0, 1).log_softmax(dim = 2)

        return x


device = "cuda"

x = torch.randn((12, 1, 161, 1200)).to(device)
y = torch.randn((1200, 12, 26)).to(device)

model = Model()
model = model.to(device)

criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr = 1e-4)
for i in range(200):
    optimizer.zero_grad()
    y_pred = model(x)

    loss = criterion(y_pred, y)
    print(f"Step {i} loss: {loss.item()}")

    loss.backward()
    optimizer.step()

Output from the script (until it hangs):

Step 0 loss: 11.632630348205566
Step 1 loss: 11.632563591003418
Step 2 loss: 11.632675170898438
Step 3 loss: 11.646272659301758
Step 4 loss: 11.646001815795898
Step 5 loss: 11.646049499511719
Step 6 loss: 11.645902633666992
Step 7 loss: 11.645750045776367
Step 8 loss: 11.645718574523926
Step 9 loss: 11.645608901977539
Step 10 loss: 11.645499229431152
Step 11 loss: 11.645389556884766
Step 12 loss: 11.645282745361328
Step 13 loss: 11.645173072814941
Step 14 loss: 11.644946098327637
Step 15 loss: 11.644959449768066
Step 16 loss: 11.644851684570312
Step 17 loss: 11.644745826721191
Step 18 loss: 11.644639015197754
Step 19 loss: 11.64453411102295
Step 20 loss: 11.644428253173828
Step 21 loss: 11.644325256347656
Step 22 loss: 11.644220352172852
Step 23 loss: 11.64393138885498
Step 24 loss: 11.644011497497559
Step 25 loss: 11.644028663635254
Step 26 loss: 11.643806457519531
Step 27 loss: 11.64370346069336
Step 28 loss: 11.643600463867188
Step 29 loss: 11.643498420715332
Step 30 loss: 11.643651962280273
Step 31 loss: 11.64329719543457
Step 32 loss: 11.643196105957031
Step 33 loss: 11.643089294433594

GPU logs (middle part - is the training process until it hangs):

PS> nvidia-smi dmon -i 0
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0     8    36     -    14    16     0     0   190    75
    0     8    36     -    17    17     0     0   203    80
    0     9    36     -    15    16     0     0   199    78
    0    26    37     -    16    14     0     0   907   222
    0    26    38     -     7     1     0     0  6800  1376
    0    26    38     -     8     1     0     0  6794  1379
    0    27    38     -     8     1     0     0  6794  1379
    0   145    46     -    79    25     0     0  6794  1624
    0   135    48     -    97    36     0     0  6794  1867
    0   154    50     -    97    39     0     0  6794  1842
    0   157    50     -    97    27     0     0  6794  1849
    0   138    52     -    97    37     0     0  6794  1849
    0   155    52     -    97    39     0     0  6794  1840
    0   133    53     -    97    26     0     0  6794  1846
    0   137    54     -    92    36     0     0  6794  1854
    0   156    55     -    98    35     0     0  6794  1830
    0   152    55     -    99    29     0     0  6794  1842
    0   132    55     -    98    39     0     0  6794  1832
    0   156    57     -    97    36     0     0  6794  1832
    0   152    57     -    98    28     0     0  6794  1837
    0   143    58     -    96    38     0     0  6794  1831
    0   153    59     -    96    33     0     0  6794  1822
    0   161    58     -    97    30     0     0  6794  1843
    0   135    60     -    98    39     0     0  6794  1833
    0   154    60     -    98    33     0     0  6794  1821
    0   159    60     -    98    31     0     0  6794  1831
    0   131    60     -    98    38     0     0  6794  1816
    0    44    56     -    93    30     0     0  6794  1813
    0    42    54     -     3     1     0     0  6794  1910
    0    27    52     -     2     1     0     0  6794  1847
    0    27    51     -     3     1     0     0  6794  1375
    0    27    50     -     3     1     0     0  6794  1375
    0    27    50     -     3     1     0     0  6794  1373
    0    27    49     -     3     1     0     0  6794  1373
    0    27    49     -     3     1     0     0  6794  1372
    0    27    48     -     3     1     0     0  6794  1374
    0    27    48     -     3     1     0     0  6794  1374
    0    27    47     -     3     1     0     0  6794  1374
    0    26    47     -     3     1     0     0  6794  1375
    0    26    47     -     3     1     0     0  6794  1375
    0    26    46     -     3     1     0     0  6794  1374
    0    26    46     -     3     1     0     0  6794  1374

Output from nvidia-smi (after 5min it hangs, memory is still allocated, repeating the command shows memory fluctuating +/- 300MiB):

PS> nvidia-smi
Tue Apr 27 22:10:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.14       Driver Version: 470.14       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:07:00.0  On |                  N/A |
| 40%   38C    P2    25W / 170W |   3688MiB /  6144MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

GPU devices nvidia-smi -L:
GPU 0: NVIDIA GeForce RTX 2060 (UUID: GPU-a6a48f53-2838-7b04-e98e-e0265c3310f8)

Running ps -aux:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
.....
domainf+  1039 99.6 35.3 30013532 2863236 pts/0 Rl+ 22:03  19:11 python sample_seq.py

Checking the task manager:

GPU 0
NVIDIA GeForce RTX 2060

Driver version:	27.21.14.7014
Driver date:	17-Mar-21
DirectX version:	12 (FL 12.1)
Physical location:	PCI bus 7, device 0, function 0

Utilization	3%
Dedicated GPU memory	3.7/6.0 GB
Shared GPU memory	0.1/8.0 GB
GPU Memory	3.8/14.0 GB

I am facing a similar problem while training a DNN with Pytorch on CUDA enabled WSL2 Ubuntu 18.04 (using Jupyterlab). Sometimes restarting the whole system helps. As a workaround, I save the parameters and resume training after a system restart.

Any updates?

I forgot to say that I’ve tested with multiple PyTorch versions (e.g. 1.8.*, 1.7.*, 1.4.*) having the same issue, using WSL 2 with Windows Insider Program version 21370 (downgraded as well to check - still having the same issue), also tried to underclock (Memory/Core Clock) with MSI afterburner but without much success.

@AakankshaS any updates? Eventually I moved everything to Windows and having absolutely no issue.

The same problem on newest drivers for me

I have the exact same issue using multiple CUDA/pytorch versions (currently 1.9.0/wsl 5.10.43/driver 150.06 with 1x 3090), easily reproducible and frustratingly hard to find the solution. Always hangs on optimizer.step() for me.

Hi, I got the same problem with the newest 510.06 driver on Windows 11.
It hangs after from some minutes to about 1 hour.
I tried to reinstall the driver and nothing changed.
Can anyone help?

Same problem here with a P1000, latest Windows 11 insider build 22458.1000 / WSL2 Ubuntu 20.04 LTS and driver and 510.06.

I am noticing a similar issue with Windows 11 and the latest insider build (2458). My driver is 30.0.15.1010 (aka 510.10?).
Given this build is “near final”, this is somewhat concerning.

Hi guys, I think I may find a solution.

I got my model successfully trained in about 4 hours with my 3070 card
Windows 11 dev channel 22463.1000 (previously I used 22458.1000)
WSL2 Ubuntu 20.04 LTS/ 5.10.43 kernel
Mainstream Game Ready driver 472.12 instead of the dev driver for wsl
CUDA 11.4 with cuDNN 8.2.2

I don’t know whether the windows update or the new NVIDIA driver solved the problem in my case, and I didn’t test for models that need a longer time to be trained. This may not be a final solution to this issue but at least the training process does not hang in just an hour now. :)

Hello,
I’m running the current windows 11 version 22000.194.
Using the test program from Cchivriga (see above) I switch back my windows driver from 510.06 to 472.12 with no success (program hangs after a few iterations).
To-day (October 5) there was a kernel update from 5.10.43.3-microsoft-standard-WSL2 to 5.10.60.1-microsoft-standard-WSL2 and the program succeeds (also using 472.12)

2 Likes

Hello all,

As pointed out by tjonker above and on that issue on github: CUDA on WSL hangs after ~1h training · Issue #7443 · microsoft/WSL (github.com) , there is a new kernel available with that might address this hang. Make sure to run “wsl --update” and then “uname -a” to check your kernel (it should be 5.10.60.1).

Thanks,

1 Like