Best Nvidia Driver for PyTorch? I'm crashing :-(

mark.burhop · March 15, 2024, 1:34pm

Training my models keeps crashing my PC because of nvlddmkm.sys. My next thought is to experiment with other versions of the drivers (anyone got a link to where I can get them?)

A few more details in case there is any nVidia expert out there. I’m using the latest version:

Windows 10 32Gig
RTX 4080 32Gig
Python 3.11.8
torch 2.2.1+cu121
torchaudio 2.2.1+cu121
torchvision 0.17.1

NVidia Drivers 551.61 (release date: 02/22/2024)

Minidump info below:

PROCESS_NAME: System

STACK_TEXT:
ffffe0032671e028 fffff80463e03ad0 : 0000000000000119 0000000000000005 ffffbc84caeaa000 ffffbc84caf30620 : nt!KeBugCheckEx
ffffe0032671e030 fffff8047cae029d : ffffbc84cafa6000 ffffbc84caeaa000 000000000000ffff fffff804712a434c : watchdog!WdLogEvent5_WdCriticalError+0xe0
ffffe0032671e070 fffff8047caced0d : ffffbc84cae78000 ffffbc84caeaa000 ffffe0032671e2c0 fffff8047cacef8e : dxgmms2!VidSchiProcessIsrFaultedPacket+0x26d
ffffe0032671e0f0 fffff8047cabe231 : ffffbc840001e794 fffff80471d4d690 ffffbc84cae78000 00000000ffffffff : dxgmms2!VidSchDdiNotifyInterruptWorker+0x10a9d
ffffe0032671e150 fffff80464a4d914 : ffffbc84c423a030 ffffe0032671e2c0 ffffbc84c4350000 0000000000000000 : dxgmms2!VidSchDdiNotifyInterrupt+0xd1
ffffe0032671e1a0 fffff80471d6b96f : ffffbc84c423a030 ffffbc84c4350000 ffffe0030000000e ffffe00300000000 : dxgkrnl!DxgNotifyInterruptCB+0x94
ffffe0032671e1d0 ffffbc84c423a030 : ffffbc84c4350000 ffffe0030000000e ffffe00300000000 fffff80471d6b8f1 : nvlddmkm+0xb9b96f
ffffe0032671e1d8 ffffbc84c4350000 : ffffe0030000000e ffffe00300000000 fffff80471d6b8f1 ffffbc84c4350000 : 0xffffbc84c423a030 ffffe0032671e1e0 ffffe0030000000e : ffffe00300000000 fffff80471d6b8f1 ffffbc84c4350000 0000000000000000 : 0xffffbc84c4350000
ffffe0032671e1e8 ffffe00300000000 : fffff80471d6b8f1 ffffbc84c4350000 0000000000000000 0000000000000000 : 0xffffe0030000000e ffffe0032671e1f0 fffff80471d6b8f1 : ffffbc84c4350000 0000000000000000 0000000000000000 0000000000000000 : 0xffffe00300000000
ffffe0032671e1f8 ffffbc84c4350000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nvlddmkm+0xb9b8f1
ffffe0032671e200 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 ffffbc84c423a030 : 0xffffbc84`c4350000

SYMBOL_NAME: nvlddmkm+b9b96f

MODULE_NAME: nvlddmkm

IMAGE_NAME: nvlddmkm.sys

STACK_COMMAND: .cxr; .ecxr ; kb

BUCKET_ID_FUNC_OFFSET: b9b96f

FAILURE_BUCKET_ID: 0x119_5_DRIVER_FAULTED_SYSTEM_COMMAND_nvlddmkm!unknown_function

OS_VERSION: 10.0.19041.1

BUILDLAB_STR: vb_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

FAILURE_ID_HASH: {55a61c3c-91b1-e527-dcff-f2f0d7348227}

Other Observations:

PC is stable everyplace else (Running Chat with RTX, Omniverse, etc.)
Training starts fine. It just crashes about 1 hr in
Changing parameters of model to reduce memory hasn’t helped so far.
GPU temp staying below 40deg.

training_params = TrainingArguments(
output_dir=“./results”,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
optim=“paged_adamw_32bit”,
save_steps=25,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type=“constant”,
report_to=“tensorboard”
)

mark.burhop · March 17, 2024, 3:29pm

I found an online example that duplicates this problem. I then ran this example on a Windows 11 PC with a RTX 2070 and get the same results. So it is not hardware.

I’ve also cross posted on the pyTorch site to see if I can find some parameters to work around the crash:
Torch crashing nVidia driver with this simple example :-( Help - windows - PyTorch Forums

Topic		Replies	Views
Kernel panic when training with PyTorch & GTX1080Ti Frameworks kernel	0	707	September 9, 2021
RTX 2080 cards crashed when training longer a PyTorch model Linux	4	1110	November 6, 2019
[CRASH!] System crash/reboot on RTX 4090 GPU - Hardware boot , cuda	1	1144	May 8, 2023
Syncbatchnorm and DDP causes crash Frameworks pytorch	1	1129	August 27, 2020
PyTorch CUDA Errors on Ubuntu 22 with RTX 3090 Ti x2 CUDA Setup and Installation cuda , ubuntu , pytorch , python	5	4537	April 29, 2023
560.35.03 Processes crashing with NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: Linux nvbugs	2	213	September 5, 2024
My desktop freezes at random times while training with pytorch Frameworks cuda , ubuntu , pytorch	3	1467	March 11, 2024
RTX3060 LHR falls off bus - Ubuntu 20.04 running pytorch and numpy inference code Linux	2	853	January 5, 2022
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	177	September 23, 2024
Nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4044 Drivers - Linux, Windows, MacOS cuda , tensorflow , ubuntu , machine-learning	3	5281	April 26, 2023

Best Nvidia Driver for PyTorch? I'm crashing :-(

Related topics