piglit opengl conformance test seems to really give nvidia drivers grief?

An app’s unit test here was flaky, and I didn’t trust my hardware, so I decided to validate it
by making sure the piglit test suite passed. That opened up quite a can of worms.

On all versions of Ubuntu I’ve tried (12.04 and 16.04),
with all nvidia driver versions I’ve tried (up to 370.28),
on all amd64 hardware I’ve tried, with really old (gt220) and merely old (gtx570) cards,
the open source piglit suite of opengl conformance/regression tests
seems to be really good at crashing the system.

Here’s the commandline I use to run a ~1 hour subset of the tests:
./piglit run -1 -v --dmesg --sync -t texture tests/quick results/quick

So of course when I come here to file a bug report, and follow the instruction to do
startx – -logverbose 6
piglit doesn’t crash the system :-) Maybe I can rely on that option, or maybe I just got lucky this time. Either way, more tests fail than pass, which isn’t too good.
I’ve ordered a gtx 1060 in hopes nvidia has given its code path more love than these old cards.

Has anybody else tried piglit on nvidia’s drivers? Does your system stay up when you do?

Note to nvidia: the other vendors started paying attention to piglit not too long ago, and their
latest open source drivers survive it well now, and even pass most tests.

Guess I just got lucky; the run with --logverbose 6 did eventually crash the OS.
Here’s the bug report log: http://kegel.com/linux/piglit-nvidia-bug-report.log.gz

The last two lines from piglit were

running: spec/arb_texture_rectangle/texrect_simple_arb_texrect
[3524/4261] skip: 230, pass: 796, dmesg-warn: 1, fail: 2494, dmesg-fail: 2, crash: 1

tail -f /var/log/kern.log showed:

Sep 16 09:58:14 rbb-ubu1604-3 kernel: [ 1475.867942] ext_texture_for[12354]: segfault at 0 ip 00007fcd50ebf640 sp 00007ffe6ec58058 error 4 in libc-2.23.so[7fcd50d5f000+1c0000]

(Note: I had to install xserver-xorg-legacy to get past the x server not being able to
detect my displays, so it ran setuid root just like it used to by default in ubuntu 12.04, I think.)

OK, finally got a chance to run with some more modern kit. This system didn’t die, but there were two segfaults in dmesg.

With a Quadro M4000 (GM204GL), OS Ubuntu 16.04, kernel 4.4.0-38-generic, and driver nvidia-370 (370.28-0ubuntu0~gpu16.04.1):

$ ./piglit summary console results/quick | tail -n 20
spec/oes_texture_storage_multisample_2d_array/preprocessor/enabled-es.geom: fail
spec/oes_texture_storage_multisample_2d_array/preprocessor/enabled-es.tesc: skip
spec/oes_texture_storage_multisample_2d_array/preprocessor/enabled-es.tese: skip
spec/oes_texture_storage_multisample_2d_array/preprocessor/enabled-es.vert: pass
summary:
name: quick
---- ------
pass: 3598
fail: 2939
crash: 2
skip: 1335
timeout: 0
warn: 4
incomplete: 0
dmesg-warn: 1
dmesg-fail: 1
changes: 0
fixes: 0
regressions: 0
total: 7880

$ ./piglit summary console results/quick | grep crash
spec/ext_texture_array/texsubimage array: crash
spec/ext_texture_format_bgra8888/api-errors: crash

$ dmesg | tail -n 2

[277017.371769] ext_texture_for[32167]: segfault at 0 ip 00007f0463740385 sp 00007fffd0bfd428 error 4 in libc-2.23.so[7f04635f3000+1c0000]
[277104.938959] texsubimage[32755]: segfault at 36ab520 ip 00007fc752b2c5d5 sp 00007ffc6d0de4c0 error 4 in libnvidia-glcore.so.370.28[7fc751a0b000+13bd000]

Next I suppose I’ll try running with __GL_FSAA_MODE=5, which is rumored to have caused “NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context” on a real app recently.

I ran a full ‘quick’ test with __GL_FSAA_MODE=5 on the m4000, and can provide the output
of nvidia-bug-report.sh and the full piglit output on request.

I had to kill two tests that seem hung / livelocked (and can provide a backtrace for one).

Afterwards, I ran the crashing tests standalone without __GL_FSAA_MODE.
Here are the ones that reproduced crashes and left stuff in dmesg reliably:

The commands

bin/texsubimage array pbo -auto -fbo
bin/texsubimage array -auto -fbo

both reliably crash and put the following in dmesg:

texsubimage[27980]: segfault at 20e0d20 ip 00007fc0492ce5d5 sp 00007ffc97347800 error 4 in libnvidia-glcore.so.370.28[7fc0481ad000+13bd000]

The command

bin/ext_texture_format_bgra8888-api-errors -auto -fbo

reliably crashes and puts the following in dmesg:

ext_texture_for[28147]: segfault at 0 ip 00007f35dc1ca385 sp 00007ffd0b13dda8 error 4 in libc-2.23.so[7f35dc07d000+1c0000]

The commands

bin/glslparsertest ~/piglit/tests/spec/glsl-1.10/compiler/void/void-equal.vert fail 1.10
bin/glslparsertest ~/piglit/tests/spec/glsl-1.10/compiler/void/void-lt.vert fail 1.10

both reliably crash and puts the following in dmesg:

glslparsertest[28167]: segfault at 8 ip 00007fad84a0bb1e sp 00007ffe20214850 error 4 in libnvidia-glcore.so.370.28[7fad8463b000+13bd000]

The commands

bin/glslparsertest ~/piglit/tests/spec/glsl-1.10/preprocessor/divide-by-zero.vert fail 1.10
bin/glslparsertest ~/piglit/tests/spec/glsl-1.10/preprocessor/modulus-by-zero.vert fail

reliably crash and put the following in demsg:

traps: glslparsertest[28205] trap divide error ip:7f871d7db4a3 sp:7ffc1b0064e8 error:0 in libnvidia-glcore.so.370.28[7f871d4ba000+13bd000]

Hah! segfaults in dmesg do not mean it’s a kernel problem, plain old user segfaults show up there, too. It seems I have not seen any kernel problems on the m4000 myself, which is great news.

It’d be nice if the nvidia drivers passed the piglit tests, but it’s possible the onus is
on the piglit test suite.