I have an issue with WSL 2 using latest CUDA driver 470.25 having one Nvidia RTX 2060, when training Deep Learning models I am always forced to run synchronous computation using the latest PyTorch version. Like so CUDA_LAUNCH_BLOCKING=1 python train.py --config ../config/default.yaml
. I suppose some sort-of deadlock occurs when running asynchronously such as omitting CUDA_LAUNCH_BLOCKING=1
which never ever occurs when using a different platform to do my trainings (e.g. Colab, 2nd computer). The training just hangs after approx. 0-100 training steps thus the process hangs (zombie) and I can’t kill it properly. Using nvidia-smi
outputs locked memory usage (CUDA memory is indeed still allocated) but with gradual decline of GPU-Util (over 10min).
May anyone help me?
Hi @Cchivriga ,
Can you pls share the process and GPU usage stats as well. Also if you can help us with reproducible script and model to assist you better.
Thanks
Hi, I’ve tried several things, the most simple example when it hangs:
import torch
import torch.nn as nn
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.sec1 = nn.Sequential(
nn.Conv2d(1, 32, (41, 11), stride = (2, 1), padding = (20, 5)),
nn.BatchNorm2d(32),
nn.LeakyReLU(0.01)
)
self.sec2 = nn.Sequential(
nn.Conv2d(32, 32, kernel_size = (21, 11), stride = (2, 1), padding = (10, 5)),
nn.BatchNorm2d(32),
nn.LeakyReLU(0.01)
)
self.gru = nn.GRU(input_size = 1312, hidden_size = 512, num_layers = 3, batch_first = True, bidirectional = False)
self.fc1 = nn.Linear(512, 26)
def forward(self, x):
batch_size = x.shape[0]
x = self.sec1(x)
x = self.sec2(x)
x = x.transpose(1, 3)
x = x.contiguous().view(batch_size, -1, x.size(2) * x.size(3))
x, _ = self.gru(x)
x = self.fc1(x)
x = x.transpose(0, 1).log_softmax(dim = 2)
return x
device = "cuda"
x = torch.randn((12, 1, 161, 1200)).to(device)
y = torch.randn((1200, 12, 26)).to(device)
model = Model()
model = model.to(device)
criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr = 1e-4)
for i in range(200):
optimizer.zero_grad()
y_pred = model(x)
loss = criterion(y_pred, y)
print(f"Step {i} loss: {loss.item()}")
loss.backward()
optimizer.step()
Output from the script (until it hangs):
Step 0 loss: 11.632630348205566
Step 1 loss: 11.632563591003418
Step 2 loss: 11.632675170898438
Step 3 loss: 11.646272659301758
Step 4 loss: 11.646001815795898
Step 5 loss: 11.646049499511719
Step 6 loss: 11.645902633666992
Step 7 loss: 11.645750045776367
Step 8 loss: 11.645718574523926
Step 9 loss: 11.645608901977539
Step 10 loss: 11.645499229431152
Step 11 loss: 11.645389556884766
Step 12 loss: 11.645282745361328
Step 13 loss: 11.645173072814941
Step 14 loss: 11.644946098327637
Step 15 loss: 11.644959449768066
Step 16 loss: 11.644851684570312
Step 17 loss: 11.644745826721191
Step 18 loss: 11.644639015197754
Step 19 loss: 11.64453411102295
Step 20 loss: 11.644428253173828
Step 21 loss: 11.644325256347656
Step 22 loss: 11.644220352172852
Step 23 loss: 11.64393138885498
Step 24 loss: 11.644011497497559
Step 25 loss: 11.644028663635254
Step 26 loss: 11.643806457519531
Step 27 loss: 11.64370346069336
Step 28 loss: 11.643600463867188
Step 29 loss: 11.643498420715332
Step 30 loss: 11.643651962280273
Step 31 loss: 11.64329719543457
Step 32 loss: 11.643196105957031
Step 33 loss: 11.643089294433594
GPU logs (middle part - is the training process until it hangs):
PS> nvidia-smi dmon -i 0
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 8 36 - 14 16 0 0 190 75
0 8 36 - 17 17 0 0 203 80
0 9 36 - 15 16 0 0 199 78
0 26 37 - 16 14 0 0 907 222
0 26 38 - 7 1 0 0 6800 1376
0 26 38 - 8 1 0 0 6794 1379
0 27 38 - 8 1 0 0 6794 1379
0 145 46 - 79 25 0 0 6794 1624
0 135 48 - 97 36 0 0 6794 1867
0 154 50 - 97 39 0 0 6794 1842
0 157 50 - 97 27 0 0 6794 1849
0 138 52 - 97 37 0 0 6794 1849
0 155 52 - 97 39 0 0 6794 1840
0 133 53 - 97 26 0 0 6794 1846
0 137 54 - 92 36 0 0 6794 1854
0 156 55 - 98 35 0 0 6794 1830
0 152 55 - 99 29 0 0 6794 1842
0 132 55 - 98 39 0 0 6794 1832
0 156 57 - 97 36 0 0 6794 1832
0 152 57 - 98 28 0 0 6794 1837
0 143 58 - 96 38 0 0 6794 1831
0 153 59 - 96 33 0 0 6794 1822
0 161 58 - 97 30 0 0 6794 1843
0 135 60 - 98 39 0 0 6794 1833
0 154 60 - 98 33 0 0 6794 1821
0 159 60 - 98 31 0 0 6794 1831
0 131 60 - 98 38 0 0 6794 1816
0 44 56 - 93 30 0 0 6794 1813
0 42 54 - 3 1 0 0 6794 1910
0 27 52 - 2 1 0 0 6794 1847
0 27 51 - 3 1 0 0 6794 1375
0 27 50 - 3 1 0 0 6794 1375
0 27 50 - 3 1 0 0 6794 1373
0 27 49 - 3 1 0 0 6794 1373
0 27 49 - 3 1 0 0 6794 1372
0 27 48 - 3 1 0 0 6794 1374
0 27 48 - 3 1 0 0 6794 1374
0 27 47 - 3 1 0 0 6794 1374
0 26 47 - 3 1 0 0 6794 1375
0 26 47 - 3 1 0 0 6794 1375
0 26 46 - 3 1 0 0 6794 1374
0 26 46 - 3 1 0 0 6794 1374
Output from nvidia-smi
(after 5min it hangs, memory is still allocated, repeating the command shows memory fluctuating +/- 300MiB):
PS> nvidia-smi
Tue Apr 27 22:10:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.14 Driver Version: 470.14 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:07:00.0 On | N/A |
| 40% 38C P2 25W / 170W | 3688MiB / 6144MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
GPU devices nvidia-smi -L
:
GPU 0: NVIDIA GeForce RTX 2060 (UUID: GPU-a6a48f53-2838-7b04-e98e-e0265c3310f8)
Running ps -aux
:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
.....
domainf+ 1039 99.6 35.3 30013532 2863236 pts/0 Rl+ 22:03 19:11 python sample_seq.py
Checking the task manager:
GPU 0
NVIDIA GeForce RTX 2060
Driver version: 27.21.14.7014
Driver date: 17-Mar-21
DirectX version: 12 (FL 12.1)
Physical location: PCI bus 7, device 0, function 0
Utilization 3%
Dedicated GPU memory 3.7/6.0 GB
Shared GPU memory 0.1/8.0 GB
GPU Memory 3.8/14.0 GB
I am facing a similar problem while training a DNN with Pytorch on CUDA enabled WSL2 Ubuntu 18.04 (using Jupyterlab). Sometimes restarting the whole system helps. As a workaround, I save the parameters and resume training after a system restart.
Any updates?
I forgot to say that I’ve tested with multiple PyTorch versions (e.g. 1.8.*, 1.7.*, 1.4.*) having the same issue, using WSL 2 with Windows Insider Program version 21370 (downgraded as well to check - still having the same issue), also tried to underclock (Memory/Core Clock) with MSI afterburner but without much success.
@AakankshaS any updates? Eventually I moved everything to Windows and having absolutely no issue.
The same problem on newest drivers for me
I have the exact same issue using multiple CUDA/pytorch versions (currently 1.9.0/wsl 5.10.43/driver 150.06 with 1x 3090), easily reproducible and frustratingly hard to find the solution. Always hangs on optimizer.step() for me.
Hi, I got the same problem with the newest 510.06 driver on Windows 11.
It hangs after from some minutes to about 1 hour.
I tried to reinstall the driver and nothing changed.
Can anyone help?
Same problem here with a P1000, latest Windows 11 insider build 22458.1000 / WSL2 Ubuntu 20.04 LTS and driver and 510.06.
I am noticing a similar issue with Windows 11 and the latest insider build (2458). My driver is 30.0.15.1010 (aka 510.10?).
Given this build is “near final”, this is somewhat concerning.
Hi guys, I think I may find a solution.
I got my model successfully trained in about 4 hours with my 3070 card
Windows 11 dev channel 22463.1000 (previously I used 22458.1000)
WSL2 Ubuntu 20.04 LTS/ 5.10.43 kernel
Mainstream Game Ready driver 472.12 instead of the dev driver for wsl
CUDA 11.4 with cuDNN 8.2.2
I don’t know whether the windows update or the new NVIDIA driver solved the problem in my case, and I didn’t test for models that need a longer time to be trained. This may not be a final solution to this issue but at least the training process does not hang in just an hour now. :)
Hello,
I’m running the current windows 11 version 22000.194.
Using the test program from Cchivriga (see above) I switch back my windows driver from 510.06 to 472.12 with no success (program hangs after a few iterations).
To-day (October 5) there was a kernel update from 5.10.43.3-microsoft-standard-WSL2 to 5.10.60.1-microsoft-standard-WSL2 and the program succeeds (also using 472.12)
Hello all,
As pointed out by tjonker above and on that issue on github: CUDA on WSL hangs after ~1h training · Issue #7443 · microsoft/WSL (github.com) , there is a new kernel available with that might address this hang. Make sure to run “wsl --update” and then “uname -a” to check your kernel (it should be 5.10.60.1).
Thanks,