[BUG] crash in libnvidia-glcore.so.387.34 thread

I just got the following crash while running a Second Life viewer (with the NVIDIA drivers in threaded mode):

0   com.secondlife.indra.viewer	0x115cae1 LLAppViewerLinux::handleSyncCrashTrace() + 209
1   com.secondlife.indra.viewer	0x1929cee default_unix_signal_handler(int, siginfo_t*, void*) + 1198
2   unknown	0x7f60f5b84ed0 /lib64/libpthread.so.0(+0x11ed0) [0x7f60f5b84ed0]
3   unknown	0x7f60d5524fae /usr/lib64/libnvidia-glcore.so.387.34(+0xd46fae) [0x7f60d5524fae]
4   unknown	0x7f60d5543e19 /usr/lib64/libnvidia-glcore.so.387.34(+0xd65e19) [0x7f60d5543e19]
5   unknown	0x7f60d5666d36 /usr/lib64/libnvidia-glcore.so.387.34(+0xe88d36) [0x7f60d5666d36]
6   unknown	0x7f60d566ce5d /usr/lib64/libnvidia-glcore.so.387.34(+0xe8ee5d) [0x7f60d566ce5d]
7   unknown	0x7f60d69b950c /usr/lib64/libGLX_nvidia.so.0(+0xaf50c) [0x7f60d69b950c]

I’m also attaching the nvidia-bug-report.log file.
nvidia-bug-report.log.gz (219 KB)

I just encountered the same bug again (exact same stack trace), something I never encountered in the last decade I used SL viewers, with former NVIDIA driver versions… I think I’m going to downgrade to v384…

Can you reproduce the problem reliably? If so, can you provide the exact steps to follow for us to observe it?
As a workaround, can you try setting __GL_THREADED_OPTIMIZATIONS=0?

Alas, I did not find any common ground to the two crashes I got, and consequently could not infer a way to reproduce this bug (else, I’d already reported it, together with a gdb session log).

No way ! I’m not going to loose 30% of frame rate… I simply downgraded to the v384 driver for now…

If there is a bug, we would like to fix it, so it would be nice if you could provide us with more information.
Ideally, a reliable way to reproduce the problem would be great. Failing that,
this frame:
2 unknown 0x7f60f5b84ed0 /lib64/libpthread.so.0(+0x11ed0) [0x7f60f5b84ed0]
is unexpected. What’s your glibc version? Can you provide the output of nm -D /lib64/libpthread.so.0, and if possible install a glibc with debug symbols, and addr2line -e /lib64/libpthread.so.0 0x11ed0

In addition, is there any way to disable SecondLife’s SIGSEGV handler (I’m assuming it’s a SIGSEGV), so that you can Can you please run under gdb, catching the crash when it happens? I’m interested in the stack trace (to see if it’s similar), but more importantly in the register values (“info registers” in gdb) at the crash point.

Thanks

EDIT: remove useless instruction, and fix broken sentence

I always provide all the information I can gather, however you must understand that I cannot afford loosing time on such issues. What matters for me is that I don’t get a crash, so I’d rather (and actually did) downgrade to the last (proven) stable driver version than run hours-long sessions under gdb in the hope to get the crash to reproduce under it…

Indeed… But like I explained, I did not find any common ground for the two crashes I got and since it can take hours for them to occur in a session (and they won’t even occur in every hours-long session), it could take months before I would figure out what situation, action, or 3D object is causing them…

Your best bet is to look at your sources and, based on the stack trace I provided, find what part of your code is racy, or fails to lock or test whatever mutex… Since the bug occurred only in v387.34 for me (and I’m running every release or beta driver, whichever was last released), I would expect it to have been introduced in a recent commit…

I’m running glibc v2.26. My Linux distro is PCLinuxOS (a rolling release distro and I keep my system up to date with it).

Sure thing. I’m attaching the result to this message.

Alas, PCLinuxOS does not provide debug symbols packages… :-/

Nope, not without modifying the sources and recompiling the viewer…

I could run the viewer under gdb (with the viewer debug symbols loaded), but since the crash happens in a thread initiated from your driver (and without the debug symbols for pthread & glibc), it would provide no additional info, I’m afraid…

phtread_dyn_symbols.txt (12.1 KB)

Please clarify: you see the problem in 387.34, but not in 387.12?
If so, that does narrow down the range on our side indeed.

Yes, running under gdb and catching the crash would actually provide a little more information. (You don’t actually need to disable the signal handler of the application, gdb will see the signal first.) That information is the state of the registers at the crash point.

Correct…

BUT: I did not run v387.12 for a very long time (perhaps a week or two, IIRC) and since it is a bug that does not show off at every session, I may just had some good luck…
On the other hand, I ran the various (beta and stable) releases of v384 for a long time (and running it right now again) and know for sure it is bug-free in this respect.

AFAIK, the viewer crash handler will kick-in in the main thread and the registers shown by gdb after the viewer stops will be unrelated with the crash (their value will have been altered by the crash handler, which is actually designed to dump the crash log, cleanly shut down the viewer and just exit the program with a non-zero value)…

If you know a way to do that under gdb while the viewer crash handler is active, let me know and I might give it a try, time permitting.