Best Nvidia Driver for PyTorch? I'm crashing :-(

Training my models keeps crashing my PC because of nvlddmkm.sys. My next thought is to experiment with other versions of the drivers (anyone got a link to where I can get them?)

A few more details in case there is any nVidia expert out there. I’m using the latest version:

Windows 10 32Gig
RTX 4080 32Gig
Python 3.11.8
torch 2.2.1+cu121
torchaudio 2.2.1+cu121
torchvision 0.17.1

NVidia Drivers 551.61 (release date: 02/22/2024)

Minidump info below:

PROCESS_NAME: System

STACK_TEXT:
ffffe0032671e028 fffff80463e03ad0 : 0000000000000119 0000000000000005 ffffbc84caeaa000 ffffbc84caf30620 : nt!KeBugCheckEx
ffffe0032671e030 fffff8047cae029d : ffffbc84cafa6000 ffffbc84caeaa000 000000000000ffff fffff804712a434c : watchdog!WdLogEvent5_WdCriticalError+0xe0
ffffe0032671e070 fffff8047caced0d : ffffbc84cae78000 ffffbc84caeaa000 ffffe0032671e2c0 fffff8047cacef8e : dxgmms2!VidSchiProcessIsrFaultedPacket+0x26d
ffffe0032671e0f0 fffff8047cabe231 : ffffbc840001e794 fffff80471d4d690 ffffbc84cae78000 00000000ffffffff : dxgmms2!VidSchDdiNotifyInterruptWorker+0x10a9d
ffffe0032671e150 fffff80464a4d914 : ffffbc84c423a030 ffffe0032671e2c0 ffffbc84c4350000 0000000000000000 : dxgmms2!VidSchDdiNotifyInterrupt+0xd1
ffffe0032671e1a0 fffff80471d6b96f : ffffbc84c423a030 ffffbc84c4350000 ffffe0030000000e ffffe00300000000 : dxgkrnl!DxgNotifyInterruptCB+0x94
ffffe0032671e1d0 ffffbc84c423a030 : ffffbc84c4350000 ffffe0030000000e ffffe00300000000 fffff80471d6b8f1 : nvlddmkm+0xb9b96f
ffffe0032671e1d8 ffffbc84c4350000 : ffffe0030000000e ffffe00300000000 fffff80471d6b8f1 ffffbc84c4350000 : 0xffffbc84c423a030 ffffe0032671e1e0 ffffe0030000000e : ffffe00300000000 fffff80471d6b8f1 ffffbc84c4350000 0000000000000000 : 0xffffbc84c4350000
ffffe0032671e1e8 ffffe00300000000 : fffff80471d6b8f1 ffffbc84c4350000 0000000000000000 0000000000000000 : 0xffffe0030000000e ffffe0032671e1f0 fffff80471d6b8f1 : ffffbc84c4350000 0000000000000000 0000000000000000 0000000000000000 : 0xffffe00300000000
ffffe0032671e1f8 ffffbc84c4350000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nvlddmkm+0xb9b8f1
ffffe0032671e200 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 ffffbc84c423a030 : 0xffffbc84`c4350000

SYMBOL_NAME: nvlddmkm+b9b96f

MODULE_NAME: nvlddmkm

IMAGE_NAME: nvlddmkm.sys

STACK_COMMAND: .cxr; .ecxr ; kb

BUCKET_ID_FUNC_OFFSET: b9b96f

FAILURE_BUCKET_ID: 0x119_5_DRIVER_FAULTED_SYSTEM_COMMAND_nvlddmkm!unknown_function

OS_VERSION: 10.0.19041.1

BUILDLAB_STR: vb_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

FAILURE_ID_HASH: {55a61c3c-91b1-e527-dcff-f2f0d7348227}

Other Observations:

  • PC is stable everyplace else (Running Chat with RTX, Omniverse, etc.)
  • Training starts fine. It just crashes about 1 hr in
  • Changing parameters of model to reduce memory hasn’t helped so far.
  • GPU temp staying below 40deg.

training_params = TrainingArguments(
output_dir=“./results”,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
optim=“paged_adamw_32bit”,
save_steps=25,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type=“constant”,
report_to=“tensorboard”
)

I found an online example that duplicates this problem. I then ran this example on a Windows 11 PC with a RTX 2070 and get the same results. So it is not hardware.

I’ve also cross posted on the pyTorch site to see if I can find some parameters to work around the crash:
Torch crashing nVidia driver with this simple example :-( Help - windows - PyTorch Forums