Possible OpenGL or driver bug

Hi,

We have been experiencing some random issues (occasional crashes and/or trashed frames) with our software when using Maxwell based Geforce GPUs (980, 980Ti and TITAN X) with all recent driver versions. Every occurence of these issues has been linked to a burst of Windows event log messages (usually three or more messages, TEX NACK / Page fault always included) from NVidia driver. Here is one of those event bursts:


Provider: nvlddmkm
EventID: 13

\Device\Video7
Graphics Exception: ILLEGAL_OPCODE
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x404490=0x80000004
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: EXTRA_MACRO_DATA
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x404490=0x80000002
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51ca24=0x80000041 0x51ca28=0x180004 0x51ca2c=0xd 0x51ca34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 1): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51ca24=0x80000041 0x51ca28=0x180004 0x51ca2c=0xc 0x51ca34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 1): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51c224=0x80000041 0x51c228=0x180001 0x51c22c=0x0 0x51c234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 0): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51c224=0x80000041 0x51c228=0x180001 0x51c22c=0x0 0x51c234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 0): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51da24=0x80000041 0x51da28=0x180001 0x51da2c=0x0 0x51da34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 3): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51da24=0x80000041 0x51da28=0x180001 0x51da2c=0x0 0x51da34=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 3): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51d224=0x80000041 0x51d228=0x180001 0x51d22c=0x0 0x51d234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 2): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
Graphics Exception: ESR 0x51d224=0x80000041 0x51d228=0x180004 0x51d22c=0xc 0x51d234=0x0
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000

\Device\Video7
NVRM: Graphics TEX Exception on (GPC 3, TPC 2): TEX NACK / Page Fault
0000000002003000000000000D00AAC0000000000000000000000000000000000000000000000000


A brief search with Google shows that similar event log events (for example TEX NACK / Page fault) have been reported recently with various problems with NVidia drivers.

Recently we found out that adding an extra glFlush() call prior to every shader change in our main draw loop prevents these issues/errors from happening. So it seems possible that the internal state of the OpenGL implementation gets somehow randomly (maybe some timing related issue is causing memory corruption?) messed up when draw commands using multiple shaders are issued within a single glFlush() call. At the moment we do not know what shader complexity, resources used by shader (textures, various buffer objects etc) or similar variables have to do with triggering of these events.

In our case, every crash has taken place somewhere in the glFlush (or inside the glFlush call implied by the OpenGL context change) implementation. Here are the call stack, the disassembly, CPU registers and partial memory dumps from a single crash event (Driver 353.62 used):


nvoglv64.dll loaded at 000000006BD40000-000000006DA82000

Call stack:

nvoglv64.dll!000000006c7481c0() 	
nvoglv64.dll!000000006c8dbe10() 	
nvoglv64.dll!000000006c83aa79() 	
nvoglv64.dll!000000006c83b1fc() 	
nvoglv64.dll!000000006c84e78c() 	
nvoglv64.dll!000000006c71c3dc() 	
nvoglv64.dll!000000006c81c5ab() 	
nvoglv64.dll!000000006c7e9140() 	
nvoglv64.dll!000000006c44654d()

Disassembly from the crash point:

000000006C748160 48 89 5C 24 08       mov         qword ptr [rsp+8],rbx  
000000006C748165 48 8B 99 38 7C 0C 00 mov         rbx,qword ptr [rcx+0C7C38h]  
000000006C74816C 8B 42 20             mov         eax,dword ptr [rdx+20h]  
000000006C74816F 4C 8B C9             mov         r9,rcx  
000000006C748172 8B 89 30 7C 0C 00    mov         ecx,dword ptr [rcx+0C7C30h]  
000000006C748178 45 0F B6 D8          movzx       r11d,r8b  
000000006C74817C 3B C1                cmp         eax,ecx  
000000006C74817E 73 10                jae         000000006C748190  
000000006C748180 48 03 C0             add         rax,rax  
000000006C748183 48 39 14 C3          cmp         qword ptr [rbx+rax*8],rdx  
000000006C748187 75 07                jne         000000006C748190  
000000006C748189 44 08 44 C3 08       or          byte ptr [rbx+rax*8+8],r8b  
000000006C74818E EB 1C                jmp         000000006C7481AC  
000000006C748190 8D 41 01             lea         eax,[rcx+1]  
000000006C748193 41 89 81 30 7C 0C 00 mov         dword ptr [r9+0C7C30h],eax  
000000006C74819A 48 8B C1             mov         rax,rcx  
000000006C74819D 48 03 C0             add         rax,rax  
000000006C7481A0 48 89 14 C3          mov         qword ptr [rbx+rax*8],rdx  
000000006C7481A4 89 4A 20             mov         dword ptr [rdx+20h],ecx  
000000006C7481A7 44 88 5C C3 08       mov         byte ptr [rbx+rax*8+8],r11b  
000000006C7481AC 48 8B 4A 18          mov         rcx,qword ptr [rdx+18h]  
000000006C7481B0 48 85 C9             test        rcx,rcx  
000000006C7481B3 74 59                je          000000006C74820E  
000000006C7481B5 48 8B 41 08          mov         rax,qword ptr [rcx+8]  
000000006C7481B9 48 3B C1             cmp         rax,rcx  
000000006C7481BC 74 50                je          000000006C74820E  
000000006C7481BE 66 90                xchg        ax,ax
(**) 000000006C7481C0 48 8B 10             mov         rdx,qword ptr [rax]
000000006C7481C3 45 8B 91 30 7C 0C 00 mov         r10d,dword ptr [r9+0C7C30h]  
000000006C7481CA 4C 8B 42 08          mov         r8,qword ptr [rdx+8]  
000000006C7481CE 41 8B 50 20          mov         edx,dword ptr [r8+20h]  
000000006C7481D2 41 3B D2             cmp         edx,r10d  
000000006C7481D5 73 10                jae         000000006C7481E7  
000000006C7481D7 48 03 D2             add         rdx,rdx  
000000006C7481DA 4C 39 04 D3          cmp         qword ptr [rbx+rdx*8],r8  
000000006C7481DE 75 07                jne         000000006C7481E7  
000000006C7481E0 44 08 5C D3 08       or          byte ptr [rbx+rdx*8+8],r11b  
000000006C7481E5 EB 1E                jmp         000000006C748205  
000000006C7481E7 41 8D 52 01          lea         edx,[r10+1]  
000000006C7481EB 41 89 91 30 7C 0C 00 mov         dword ptr [r9+0C7C30h],edx  
000000006C7481F2 49 8B D2             mov         rdx,r10  
000000006C7481F5 48 03 D2             add         rdx,rdx  
000000006C7481F8 4C 89 04 D3          mov         qword ptr [rbx+rdx*8],r8  
000000006C7481FC 45 89 50 20          mov         dword ptr [r8+20h],r10d  
000000006C748200 44 88 5C D3 08       mov         byte ptr [rbx+rdx*8+8],r11b  
000000006C748205 48 8B 40 08          mov         rax,qword ptr [rax+8]  
000000006C748209 48 3B C1             cmp         rax,rcx  
000000006C74820C 75 B2                jne         000000006C7481C0  
000000006C74820E 41 8B 81 80 7D 0C 00 mov         eax,dword ptr [r9+0C7D80h]  
000000006C748215 41 39 81 30 7C 0C 00 cmp         dword ptr [r9+0C7C30h],eax  
000000006C74821C 7C 08                jl          000000006C748226  
000000006C74821E 49 8B 41 68          mov         rax,qword ptr [r9+68h]  
000000006C748222 49 89 41 70          mov         qword ptr [r9+70h],rax  
000000006C748226 48 8B 5C 24 08       mov         rbx,qword ptr [rsp+8]  
000000006C74822B C3                   ret

(**) The crash happens here, usually due to rax being zero. At some crash occurences rax is valid and access violation takes place at 000000006C7481CA due to rdx being zero (and thus read from 0x0000000000000008 causes the violation). We have noticed that this branch of the function is never executed in our software during the normal execution. Every execution in this branch seems to lead to an access violation crash.

CPU registers:

RAX = 0000000000000000 RBX = 00000000361CB320 RCX = 0000000037F76710 
RDX = 000000000B650030 RSI = 000000000B6904F0 RDI = 0000000000000034 
R8  = 0000000000000000 R9  = 000000000B6904F0 R10 = 000000000B6904F0 
R11 = 0000000000000000 R12 = 0000000019E57B30 R13 = 0000000000000000 
R14 = 00000000000004A0 R15 = 000000000B758220 RIP = 000000006C7481C0 
RSP = 000000000A9EE0B8 RBP = 0000000000000010 EFL = 00010285

Memory around rcx (0000000037F76710):
(There seems to be some sort of recurring structure in memory. Only one whole structure is captured here)

0x0000000037F76698  0000000000000000 0000000037f76698 0000000037f76698 
0x0000000037F766B0  0000000000000000 0000000000000000 0000000000000000 
0x0000000037F766C8  000000000023e8fd 0000000000000000 0000000000000000 
0x0000000037F766E0  00000000000005e0 0000000000000000 0000000000000000 
0x0000000037F766F8  0000000000000000 0000000038919110 9000007b3aba241d 
0x0000000037F76710  0000000000000001 0000000000000000 000000006d8c2080 
0x0000000037F76728  0000000000053400 0000000000000000 0000000408121210 
0x0000000037F76740  0000000000000000 000000000b2a0000 0000010000030009 
0x0000000037F76758  0000000000000000 00000000fff40000 0000000000000000 
0x0000000037F76770  0000000000000000 0000000000000000 0000000000000000 
0x0000000037F76788  0000080000000000 0000000000053400 000088e400000002 
0x0000000037F767A0  0000000000000103 0000000000000000 0000000000000000 
0x0000000037F767B8  0000000000000000 0000000000000000 0000000000000000 
0x0000000037F767D0  0000000000000000 0000000000000000 00000002000041b1 
0x0000000037F767E8  0000000000000000 0000000000000000 0000000000000001 
0x0000000037F76800  000000000abefc70 0000000000000ee4 0000000000053400 
0x0000000037F76818  0000000000000000 0000000000000020 0000000000000000 
0x0000000037F76830  0000000000000000 0000000000000000 0000000037f76838 
0x0000000037F76848  0000000037f76838 0000000000000000 0000000000000000

Memory around rdx (000000000B650030):

0x000000000B64FFE8  0000000000000000 0000000000000000 0000000000000000 
0x000000000B650000  0000000000000000 0000000000000000 c0007f0000000001 
0x000000000B650018  0000000000000000 0000000000000000 8000007b3a43b86c 
0x000000000B650030  00000000366a01ca 000000001ac718f0 0000000037f75bb0 
0x000000000B650048  0000000037f76710 0000000000000017 0000000000000020 
0x000000000B650060  0000000006bf0000 000000003899b850 000000003899d590 
0x000000000B650078  000000003899ef90 00000000389a1010 000000003899ffd0 
0x000000000B650090  000000003899c3b0 00000000389a0310 00000000389a04b0 
0x000000000B6500A8  00000000389a0cd0 0000000038861530 0000000038864ad0 
0x000000000B6500C0  00000000389a0b30 00000000366a8d70 000000003899eab0 
0x000000000B6500D8  00000000366a1ef0 0000000000000000 8800007b3a43b860 
0x000000000B6500F0  0000000008da5d30 80009f00c0008e00 0000000000000000 
0x000000000B650108  0000000000000000 0000002600000000 0000000000000007

Memory around r9 (000000000B6904F0):

0x000000000B690478  0000000000000000 0000000000000000 0000000000000000
0x000000000B690490  0000000000000000 000fa59805000005 000000000aef6270
0x000000000B6904A8  0000000000150490 0000000000000000 0000000000000000
0x000000000B6904C0  0000000000000000 0000000000000000 0000000000000000
0x000000000B6904D8  0000000000000000 0000000000000000 080fa4b252b04080
0x000000000B6904F0  0000d00400000009 00000000beef0100 0000000000000000
0x000000000B690508  000000000000411e 0000800000008000 0000000400000980
0x000000000B690520  0000000800004000 0000000000000000 000100000000000a
0x000000000B690538  0000000000000002 0000011d00000000 0000001000000014
0x000000000B690550  0000000000000023 0000000034ea3200 0000000034edfeb0
0x000000000B690568  0000000034ea32c4 0000000000000000 0000000000000000

Memory around r9 + 0C7C30h (000000000B758120):

0x000000000B758090  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7580A8  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7580C0  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7580D8  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7580F0  0000000000000000 0000000000000000 0000000000000000 
0x000000000B758108  0000000000000000 0000000000000000 0000000000000000 
0x000000000B758120  0000000000000018 00000000361cb320 000000000af49f00 
0x000000000B758138  0000000019e57b30 0000000019e5b140 0000000000000000 
0x000000000B758150  003fc7777401f9a1 0000000000400100 000000000000c000 
0x000000000B758168  0000000000000000 0000000000000000 0000000000000000 
0x000000000B758180  000c000000000000 0000000000000000 0000000000000000 
0x000000000B758198  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7581B0  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7581C8  0000000000000000 0000000000000000 0000000000000000 
0x000000000B7581E0  0001c00000000000 0000000000000000 0000000000000000 
0x000000000B7581F8  0000000000000000 0000000000000000 0000000000000000 
0x000000000B758210  0000000000000000 0000000000000000 0000000000000b00 
0x000000000B758228  0000000000000000 0000000000000000 0000000000000000 
0x000000000B758240  0000000000000000 0000000000000000 00000000365092e0 
0x000000000B758258  00000000377c5b30 0000000000000000 0000000000000000

We can’t send you a full memory dump or a program that replicates the problem, but sending more partial memory dumps could be an option, if you can specify more closely what data we should be looking for.


Reproducing the problem was easier with 980 Ti and TITAN X than with 980. However, crashing occurs more often with 980 while 980 Ti and TITAN X almost always produce only event log messages (and some trash flashing on the screen). And for curiosity, if NVidia Perfkit was loaded while error was triggered, the GPU core clock dropped significantly and permanently (driver restart was required to reset the situation). OS we use is Windows 7. Only single GPU configuration was used (no SLI). Multiple hardware configurations (CPU, MB, memory, power source) around GPU were used and all configurations had these issues.

Are there any known issues that could cause this kind of behaviour? Or is there a way to cause this kind of driver errors and event log events by using OpenGL in some wicked, improper manner? We have tried to be extra careful when checking our software for possible missuses of OpenGL, but of course it is still possible that we are causing this problem ourselves.