Hi,
We have been experiencing some random issues (occasional crashes and/or trashed frames) with our software when using Maxwell based Geforce GPUs (980, 980Ti and TITAN X) with all recent driver versions. Every occurence of these issues has been linked to a burst of Windows event log messages (usually three or more messages, TEX NACK / Page fault always included) from NVidia driver. Here is one of those event bursts:
Provider: nvlddmkm
EventID: 13
\Device\Video7
Graphics Exception: ILLEGAL_OPCODE
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x404490=0x80000004
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: EXTRA_MACRO_DATA
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x404490=0x80000002
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51ca24=0x80000041 0x51ca28=0x180004 0x51ca2c=0xd 0x51ca34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 1): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51ca24=0x80000041 0x51ca28=0x180004 0x51ca2c=0xc 0x51ca34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 1): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51c224=0x80000041 0x51c228=0x180001 0x51c22c=0x0 0x51c234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 0): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51c224=0x80000041 0x51c228=0x180001 0x51c22c=0x0 0x51c234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 0): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51da24=0x80000041 0x51da28=0x180001 0x51da2c=0x0 0x51da34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 3): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51da24=0x80000041 0x51da28=0x180001 0x51da2c=0x0 0x51da34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 3): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51d224=0x80000041 0x51d228=0x180001 0x51d22c=0x0 0x51d234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 2): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
Graphics Exception: ESR 0x51d224=0x80000041 0x51d228=0x180004 0x51d22c=0xc 0x51d234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 2): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000
A brief search with Google shows that similar event log events (for example TEX NACK / Page fault) have been reported recently with various problems with NVidia drivers.
Recently we found out that adding an extra glFlush() call prior to every shader change in our main draw loop prevents these issues/errors from happening. So it seems possible that the internal state of the OpenGL implementation gets somehow randomly (maybe some timing related issue is causing memory corruption?) messed up when draw commands using multiple shaders are issued within a single glFlush() call. At the moment we do not know what shader complexity, resources used by shader (textures, various buffer objects etc) or similar variables have to do with triggering of these events.
In our case, every crash has taken place somewhere in the glFlush (or inside the glFlush call implied by the OpenGL context change) implementation. Here are the call stack, the disassembly, CPU registers and partial memory dumps from a single crash event (Driver 353.62 used):
nvoglv64.dll loaded at 000000006BD40000-000000006DA82000
Call stack:
nvoglv64.dll!000000006c7481c0()
nvoglv64.dll!000000006c8dbe10()
nvoglv64.dll!000000006c83aa79()
nvoglv64.dll!000000006c83b1fc()
nvoglv64.dll!000000006c84e78c()
nvoglv64.dll!000000006c71c3dc()
nvoglv64.dll!000000006c81c5ab()
nvoglv64.dll!000000006c7e9140()
nvoglv64.dll!000000006c44654d()
Disassembly from the crash point:
000000006C748160 48 89 5C 24 08 mov qword ptr [rsp+8],rbx
000000006C748165 48 8B 99 38 7C 0C 00 mov rbx,qword ptr [rcx+0C7C38h]
000000006C74816C 8B 42 20 mov eax,dword ptr [rdx+20h]
000000006C74816F 4C 8B C9 mov r9,rcx
000000006C748172 8B 89 30 7C 0C 00 mov ecx,dword ptr [rcx+0C7C30h]
000000006C748178 45 0F B6 D8 movzx r11d,r8b
000000006C74817C 3B C1 cmp eax,ecx
000000006C74817E 73 10 jae 000000006C748190
000000006C748180 48 03 C0 add rax,rax
000000006C748183 48 39 14 C3 cmp qword ptr [rbx+rax*8],rdx
000000006C748187 75 07 jne 000000006C748190
000000006C748189 44 08 44 C3 08 or byte ptr [rbx+rax*8+8],r8b
000000006C74818E EB 1C jmp 000000006C7481AC
000000006C748190 8D 41 01 lea eax,[rcx+1]
000000006C748193 41 89 81 30 7C 0C 00 mov dword ptr [r9+0C7C30h],eax
000000006C74819A 48 8B C1 mov rax,rcx
000000006C74819D 48 03 C0 add rax,rax
000000006C7481A0 48 89 14 C3 mov qword ptr [rbx+rax*8],rdx
000000006C7481A4 89 4A 20 mov dword ptr [rdx+20h],ecx
000000006C7481A7 44 88 5C C3 08 mov byte ptr [rbx+rax*8+8],r11b
000000006C7481AC 48 8B 4A 18 mov rcx,qword ptr [rdx+18h]
000000006C7481B0 48 85 C9 test rcx,rcx
000000006C7481B3 74 59 je 000000006C74820E
000000006C7481B5 48 8B 41 08 mov rax,qword ptr [rcx+8]
000000006C7481B9 48 3B C1 cmp rax,rcx
000000006C7481BC 74 50 je 000000006C74820E
000000006C7481BE 66 90 xchg ax,ax
(**) 000000006C7481C0 48 8B 10 mov rdx,qword ptr [rax]
000000006C7481C3 45 8B 91 30 7C 0C 00 mov r10d,dword ptr [r9+0C7C30h]
000000006C7481CA 4C 8B 42 08 mov r8,qword ptr [rdx+8]
000000006C7481CE 41 8B 50 20 mov edx,dword ptr [r8+20h]
000000006C7481D2 41 3B D2 cmp edx,r10d
000000006C7481D5 73 10 jae 000000006C7481E7
000000006C7481D7 48 03 D2 add rdx,rdx
000000006C7481DA 4C 39 04 D3 cmp qword ptr [rbx+rdx*8],r8
000000006C7481DE 75 07 jne 000000006C7481E7
000000006C7481E0 44 08 5C D3 08 or byte ptr [rbx+rdx*8+8],r11b
000000006C7481E5 EB 1E jmp 000000006C748205
000000006C7481E7 41 8D 52 01 lea edx,[r10+1]
000000006C7481EB 41 89 91 30 7C 0C 00 mov dword ptr [r9+0C7C30h],edx
000000006C7481F2 49 8B D2 mov rdx,r10
000000006C7481F5 48 03 D2 add rdx,rdx
000000006C7481F8 4C 89 04 D3 mov qword ptr [rbx+rdx*8],r8
000000006C7481FC 45 89 50 20 mov dword ptr [r8+20h],r10d
000000006C748200 44 88 5C D3 08 mov byte ptr [rbx+rdx*8+8],r11b
000000006C748205 48 8B 40 08 mov rax,qword ptr [rax+8]
000000006C748209 48 3B C1 cmp rax,rcx
000000006C74820C 75 B2 jne 000000006C7481C0
000000006C74820E 41 8B 81 80 7D 0C 00 mov eax,dword ptr [r9+0C7D80h]
000000006C748215 41 39 81 30 7C 0C 00 cmp dword ptr [r9+0C7C30h],eax
000000006C74821C 7C 08 jl 000000006C748226
000000006C74821E 49 8B 41 68 mov rax,qword ptr [r9+68h]
000000006C748222 49 89 41 70 mov qword ptr [r9+70h],rax
000000006C748226 48 8B 5C 24 08 mov rbx,qword ptr [rsp+8]
000000006C74822B C3 ret
(**) The crash happens here, usually due to rax being zero. At some crash occurences rax is valid and access violation takes place at 000000006C7481CA due to rdx being zero (and thus read from 0x0000000000000008 causes the violation). We have noticed that this branch of the function is never executed in our software during the normal execution. Every execution in this branch seems to lead to an access violation crash.
CPU registers:
RAX = 0000000000000000 RBX = 00000000361CB320 RCX = 0000000037F76710
RDX = 000000000B650030 RSI = 000000000B6904F0 RDI = 0000000000000034
R8 = 0000000000000000 R9 = 000000000B6904F0 R10 = 000000000B6904F0
R11 = 0000000000000000 R12 = 0000000019E57B30 R13 = 0000000000000000
R14 = 00000000000004A0 R15 = 000000000B758220 RIP = 000000006C7481C0
RSP = 000000000A9EE0B8 RBP = 0000000000000010 EFL = 00010285
Memory around rcx (0000000037F76710):
(There seems to be some sort of recurring structure in memory. Only one whole structure is captured here)
0x0000000037F76698 0000000000000000 0000000037f76698 0000000037f76698
0x0000000037F766B0 0000000000000000 0000000000000000 0000000000000000
0x0000000037F766C8 000000000023e8fd 0000000000000000 0000000000000000
0x0000000037F766E0 00000000000005e0 0000000000000000 0000000000000000
0x0000000037F766F8 0000000000000000 0000000038919110 9000007b3aba241d
0x0000000037F76710 0000000000000001 0000000000000000 000000006d8c2080
0x0000000037F76728 0000000000053400 0000000000000000 0000000408121210
0x0000000037F76740 0000000000000000 000000000b2a0000 0000010000030009
0x0000000037F76758 0000000000000000 00000000fff40000 0000000000000000
0x0000000037F76770 0000000000000000 0000000000000000 0000000000000000
0x0000000037F76788 0000080000000000 0000000000053400 000088e400000002
0x0000000037F767A0 0000000000000103 0000000000000000 0000000000000000
0x0000000037F767B8 0000000000000000 0000000000000000 0000000000000000
0x0000000037F767D0 0000000000000000 0000000000000000 00000002000041b1
0x0000000037F767E8 0000000000000000 0000000000000000 0000000000000001
0x0000000037F76800 000000000abefc70 0000000000000ee4 0000000000053400
0x0000000037F76818 0000000000000000 0000000000000020 0000000000000000
0x0000000037F76830 0000000000000000 0000000000000000 0000000037f76838
0x0000000037F76848 0000000037f76838 0000000000000000 0000000000000000
Memory around rdx (000000000B650030):
0x000000000B64FFE8 0000000000000000 0000000000000000 0000000000000000
0x000000000B650000 0000000000000000 0000000000000000 c0007f0000000001
0x000000000B650018 0000000000000000 0000000000000000 8000007b3a43b86c
0x000000000B650030 00000000366a01ca 000000001ac718f0 0000000037f75bb0
0x000000000B650048 0000000037f76710 0000000000000017 0000000000000020
0x000000000B650060 0000000006bf0000 000000003899b850 000000003899d590
0x000000000B650078 000000003899ef90 00000000389a1010 000000003899ffd0
0x000000000B650090 000000003899c3b0 00000000389a0310 00000000389a04b0
0x000000000B6500A8 00000000389a0cd0 0000000038861530 0000000038864ad0
0x000000000B6500C0 00000000389a0b30 00000000366a8d70 000000003899eab0
0x000000000B6500D8 00000000366a1ef0 0000000000000000 8800007b3a43b860
0x000000000B6500F0 0000000008da5d30 80009f00c0008e00 0000000000000000
0x000000000B650108 0000000000000000 0000002600000000 0000000000000007
Memory around r9 (000000000B6904F0):
0x000000000B690478 0000000000000000 0000000000000000 0000000000000000
0x000000000B690490 0000000000000000 000fa59805000005 000000000aef6270
0x000000000B6904A8 0000000000150490 0000000000000000 0000000000000000
0x000000000B6904C0 0000000000000000 0000000000000000 0000000000000000
0x000000000B6904D8 0000000000000000 0000000000000000 080fa4b252b04080
0x000000000B6904F0 0000d00400000009 00000000beef0100 0000000000000000
0x000000000B690508 000000000000411e 0000800000008000 0000000400000980
0x000000000B690520 0000000800004000 0000000000000000 000100000000000a
0x000000000B690538 0000000000000002 0000011d00000000 0000001000000014
0x000000000B690550 0000000000000023 0000000034ea3200 0000000034edfeb0
0x000000000B690568 0000000034ea32c4 0000000000000000 0000000000000000
Memory around r9 + 0C7C30h (000000000B758120):
0x000000000B758090 0000000000000000 0000000000000000 0000000000000000
0x000000000B7580A8 0000000000000000 0000000000000000 0000000000000000
0x000000000B7580C0 0000000000000000 0000000000000000 0000000000000000
0x000000000B7580D8 0000000000000000 0000000000000000 0000000000000000
0x000000000B7580F0 0000000000000000 0000000000000000 0000000000000000
0x000000000B758108 0000000000000000 0000000000000000 0000000000000000
0x000000000B758120 0000000000000018 00000000361cb320 000000000af49f00
0x000000000B758138 0000000019e57b30 0000000019e5b140 0000000000000000
0x000000000B758150 003fc7777401f9a1 0000000000400100 000000000000c000
0x000000000B758168 0000000000000000 0000000000000000 0000000000000000
0x000000000B758180 000c000000000000 0000000000000000 0000000000000000
0x000000000B758198 0000000000000000 0000000000000000 0000000000000000
0x000000000B7581B0 0000000000000000 0000000000000000 0000000000000000
0x000000000B7581C8 0000000000000000 0000000000000000 0000000000000000
0x000000000B7581E0 0001c00000000000 0000000000000000 0000000000000000
0x000000000B7581F8 0000000000000000 0000000000000000 0000000000000000
0x000000000B758210 0000000000000000 0000000000000000 0000000000000b00
0x000000000B758228 0000000000000000 0000000000000000 0000000000000000
0x000000000B758240 0000000000000000 0000000000000000 00000000365092e0
0x000000000B758258 00000000377c5b30 0000000000000000 0000000000000000
We can’t send you a full memory dump or a program that replicates the problem, but sending more partial memory dumps could be an option, if you can specify more closely what data we should be looking for.
Reproducing the problem was easier with 980 Ti and TITAN X than with 980. However, crashing occurs more often with 980 while 980 Ti and TITAN X almost always produce only event log messages (and some trash flashing on the screen). And for curiosity, if NVidia Perfkit was loaded while error was triggered, the GPU core clock dropped significantly and permanently (driver restart was required to reset the situation). OS we use is Windows 7. Only single GPU configuration was used (no SLI). Multiple hardware configurations (CPU, MB, memory, power source) around GPU were used and all configurations had these issues.
Are there any known issues that could cause this kind of behaviour? Or is there a way to cause this kind of driver errors and event log events by using OpenGL in some wicked, improper manner? We have tried to be extra careful when checking our software for possible missuses of OpenGL, but of course it is still possible that we are causing this problem ourselves.