Training my models keeps crashing my PC because of nvlddmkm.sys. My next thought is to experiment with other versions of the drivers (anyone got a link to where I can get them?)
A few more details in case there is any nVidia expert out there. I’m using the latest version:
Windows 10 32Gig
RTX 4080 32Gig
Python 3.11.8
torch 2.2.1+cu121
torchaudio 2.2.1+cu121
torchvision 0.17.1
NVidia Drivers 551.61 (release date: 02/22/2024)
Minidump info below:
PROCESS_NAME: System
STACK_TEXT:
ffffe0032671e028 fffff804
63e03ad0 : 0000000000000119 00000000
00000005 ffffbc84caeaa000 ffffbc84
caf30620 : nt!KeBugCheckEx
ffffe0032671e030 fffff804
7cae029d : ffffbc84cafa6000 ffffbc84
caeaa000 000000000000ffff fffff804
712a434c : watchdog!WdLogEvent5_WdCriticalError+0xe0
ffffe0032671e070 fffff804
7caced0d : ffffbc84cae78000 ffffbc84
caeaa000 ffffe0032671e2c0 fffff804
7cacef8e : dxgmms2!VidSchiProcessIsrFaultedPacket+0x26d
ffffe0032671e0f0 fffff804
7cabe231 : ffffbc840001e794 fffff804
71d4d690 ffffbc84cae78000 00000000
ffffffff : dxgmms2!VidSchDdiNotifyInterruptWorker+0x10a9d
ffffe0032671e150 fffff804
64a4d914 : ffffbc84c423a030 ffffe003
2671e2c0 ffffbc84c4350000 00000000
00000000 : dxgmms2!VidSchDdiNotifyInterrupt+0xd1
ffffe0032671e1a0 fffff804
71d6b96f : ffffbc84c423a030 ffffbc84
c4350000 ffffe0030000000e ffffe003
00000000 : dxgkrnl!DxgNotifyInterruptCB+0x94
ffffe0032671e1d0 ffffbc84
c423a030 : ffffbc84c4350000 ffffe003
0000000e ffffe00300000000 fffff804
71d6b8f1 : nvlddmkm+0xb9b96f
ffffe0032671e1d8 ffffbc84
c4350000 : ffffe0030000000e ffffe003
00000000 fffff80471d6b8f1 ffffbc84
c4350000 : 0xffffbc84c423a030 ffffe003
2671e1e0 ffffe0030000000e : ffffe003
00000000 fffff80471d6b8f1 ffffbc84
c4350000 0000000000000000 : 0xffffbc84
c4350000
ffffe0032671e1e8 ffffe003
00000000 : fffff80471d6b8f1 ffffbc84
c4350000 0000000000000000 00000000
00000000 : 0xffffe0030000000e ffffe003
2671e1f0 fffff80471d6b8f1 : ffffbc84
c4350000 0000000000000000 00000000
00000000 0000000000000000 : 0xffffe003
00000000
ffffe0032671e1f8 ffffbc84
c4350000 : 0000000000000000 00000000
00000000 0000000000000000 00000000
00000000 : nvlddmkm+0xb9b8f1
ffffe0032671e200 00000000
00000000 : 0000000000000000 00000000
00000000 0000000000000000 ffffbc84
c423a030 : 0xffffbc84`c4350000SYMBOL_NAME: nvlddmkm+b9b96f
MODULE_NAME: nvlddmkm
IMAGE_NAME: nvlddmkm.sys
STACK_COMMAND: .cxr; .ecxr ; kb
BUCKET_ID_FUNC_OFFSET: b9b96f
FAILURE_BUCKET_ID: 0x119_5_DRIVER_FAULTED_SYSTEM_COMMAND_nvlddmkm!unknown_function
OS_VERSION: 10.0.19041.1
BUILDLAB_STR: vb_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {55a61c3c-91b1-e527-dcff-f2f0d7348227}
Other Observations:
- PC is stable everyplace else (Running Chat with RTX, Omniverse, etc.)
- Training starts fine. It just crashes about 1 hr in
- Changing parameters of model to reduce memory hasn’t helped so far.
- GPU temp staying below 40deg.
training_params = TrainingArguments(
output_dir=“./results”,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
optim=“paged_adamw_32bit”,
save_steps=25,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type=“constant”,
report_to=“tensorboard”
)