Seg. Fault using dlopen with openGL & several Nvidia drivers (319.12) on RedHat, Suse, & Ubu

SUMMARY:
I’m experiencing a very unusual segmentation fault on linux machines for an openGL application I am working on. The main application I am working on uses qt, but I have reproduced the issue in a simple Glut application. The machines experiencing the problem are using later Nvidia drivers with GL version 4 (see below for more details). For one machine that was previously unaffected by the issue, upgrading to the latest drivers induced the crash.

The general (but not necessarily exclusive) method to reproduce the crash is the following:

  1. An OpenGL app establishes a vertex pointer for drawing vertex data (e.g. glVertexPointer()).
  2. The app draws openGL vertex data (e.g. glDrawArrays()).
  3. The same application attempts to load (dlopen) and unload (dlclose) an arbitrary shared library (not libGL).
  4. If a segmentation fault does not occur, repeat 1-3 until it does.

Notes:

  • The crash occurs during execution of dlopen().
  • The crash does not repeat when running with strace, gdb, or valgrind.
  • A coredump is produced on seg. fault. The resulting stack trace from the glut sample app is included below.
  • The crash is irregular (May crash quickly or may require many attempts).
  • The crash does not manifest when running with older versions of GL (<= 2.*).
  • The crash does not occur if the vertex pointer passed to openGL points to non-dynamically allocated memory (i.e. copying into a local array and passing that pointer will prevent the crash).
  • I've tried changing to the compatibility profile and changing GL version (using qt) and the problem persists.
  • On one machine, I found that if I do not have permissions for device files /dev/nvidia*, then GL version 2.1.2 will be reported instead of GL 4+ and the crash does not manifest.

SYSTEMS:

The crash has been reproduced on the following systems:

  • Linux: Red Hat 5.9 (Tikanga) (Machine #1)

    • Memory: 16GB
    • Video: nvidia quadro 2000
    • GL/Drivers: 4.3.0 NVIDIA 319.12
  • Linux: Red Hat 5.9 (Tikanga) (Machine #2)

    • Memory: 96GB
    • Video: 2GB nvidia Quadro 4000
    • GL/Drivers: 4.3.0 nvidia 319.23
  • Linux: SUSE Linux Enterprise Desktop 11 (x86_64)

    • Memory: 16GB
    • Video: nvidia quadro 600
    • GL/Drivers: 4.3.0 NVIDIA 319.12
  • Linux: Ubuntu 12.04.1 LTS (64-bit)

    • Memory: 16GB
    • Video: nvidia quadro 600
    • GL/Drivers: 4.2.0 nvidia 304.64

The crash can not be reproduced on the following systems:

  • Linux: Red Hat 5.3 (Tikanga) (64-bit)

    • Memory: 16GB
    • Video: nVidia Quadro FX 3450/4000 SDI
    • GL/Drivers: 1.2 (1.5 Mesa 6.5.1)
  • Linux: CentOS release 6.4

    • Memory: 16GB
    • Video: nvidia quadro 600
    • GL/Drivers: 4.3.0 NVIDIA 319.60

SAMPLE APPLICATION:
To make things easier, I’ve attached a sample application that reproduces the issue using glut. My main application uses qt, but the sample glut app seems to show the same behavior. Please note, the sample application is contrived and very simple (and quickly written). The crash occurs in my main application for primitives other than GL_LINE_STRIP.

The app contains two main pieces: 1) a simple glut app that draws lines and loads a shared library, and 2) a simple library. I’ve included source code for a “minimal” library that may be used for testing; however, there is nothing special about this library except that it is as minimal as possible. Another library may suffice.

The provided glut application is very simple. Line strips are randomly generated and drawn with a default size of 400 pts. Given a shared library, you may load & unload that library. A combination of these two actions will cause the crash.

I’ve included a README file in the attached sample app zip. Please see that for details on building & running the app.

CORE:
Here is a stacktrace from a typical coredump. Notice the libGL references when dlopen should instead be loading the custom library:

#0 0x000000360247252b in _int_malloc () from /lib64/libc.so.6
#1 0x0000003602473f6e in malloc () from /lib64/libc.so.6
#2 0x0000003602475c28 in realloc () from /lib64/libc.so.6
#3 0x0000003e274abcdb in ?? () from /usr/lib64/libGL.so.1
#4 0x0000003e274a716a in ?? () from /usr/lib64/libGL.so.1
#5 0x0000003e274ae53d in ?? () from /usr/lib64/libGL.so.1
#6 0x0000003e274a6905 in ?? () from /usr/lib64/libGL.so.1
#7 0x0000003602010f69 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#8 0x000000360200d136 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9 0x00000036020108bc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x0000003603000f9a in dlopen_doit () from /lib64/libdl.so.2
#11 0x000000360200d136 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#12 0x000000360300150d in _dlerror_run () from /lib64/libdl.so.2
#13 0x0000003603000f11 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#14 0x0000000000401691 in loadLibrary(std::basic_string<char, std::char_traits, std::allocator > const&) ()
#15 0x0000000000401a17 in handleKey(unsigned char, int, int) ()
#16 0x00002b45e19a07ea in glutMainLoopEvent () from /usr/lib64/libglut.so.3
#17 0x00002b45e19a0f4a in glutMainLoop () from /usr/lib64/libglut.so.3
#18 0x0000000000401b1f in main ()

I’ll be attaching the sample app code as well as the nvidia-bug-report.sh log.

EDIT 7/24/2014
I updated the sample application to report/print GL version information to the console on startup. I also updated the readme for clarity. Additionally, I attached a screenshot of the sample app in action. It isn’t pretty, but this should give you an idea of what it is supposed to look like.
nvidia-bug-report.log.gz (105 KB)
glutSampleApp.zip (5.53 KB)
Screenshot.png

Bumping this topic.

I’m really stumped on this. Any suggestions/help will be greatly appreciated. If there is any additional information I can provide, please let me know.

Thanks

I have tried your sample app on CentOS 5.10 with the 340.24 driver and a GT 240, and on CentOS 6.5 with the 331.89 driver and a K20 over VirtualGL.

No segfaults and I’m pressing ‘1’ and ‘2’ until my fingers bleed. :)

Hi chemal! Thanks for taking the time to look at this. Your input is greatly appreciated!

That’s interesting. I could never get our in-house CentOS box to segfault either. I also haven’t had any luck reproducing the problem remotely or on virtual machines. In those cases, I wasn’t able to run with the latest Nvidia drivers though. I haven’t tried virtualGL. Additionally, I have primarily been testing this on Quadro series GPUs as its what we’ve got in-house. Now, the K20 is one fancy GPU!

If you have the time, here are a few other things to try:

  1. I've updated the sample app to report GL Version on launch. I've re-attached the latest version to my initial post, but you can also add the following snippet of code to main.cpp just before glutMainLoop() near the end of the file (around line 144):
    printf("\nGL Version: %s\n", glGetString(GL_VERSION));
    fflush(0);
    

    What GL Version is reported for you? Older versions of GL (<= 2.*), don’t crash for me. I usually see the crash on GL versions around 4 (e.g. 4.3.0).

  2. Depending on how arduously you want to test this bug, you could try drawing lines with more points. This can be achieved by simply updating the const variable "DefaultNumPts" that is located at main.cxx:15. Right now, its set to 400 pts per line. That number is fairly arbitrary, but, subjectively speaking, it does seem like higher pt counts increase the likelihood of a crash. For example, you might try ~1000pts.
  3. I'm sure you already checked this, but make certain the shared library gets loaded successfully when you press '2'. You should get a prompt in the terminal if it fails to load. When the app crashes for me, I see command-line output like the following:

    user@machine:~/glutSampleApp/tmp> ./glutSampleApp /home/user/glutSampleApp/SimpleLibrary.so
    Library set as /home/user/glutSampleApp/SimpleLibrary.so

    GL Version: 4.3.0 NVIDIA 319.32
    Loaded/Unloaded shared object successfully.
    Loaded/Unloaded shared object successfully.
    Loaded/Unloaded shared object successfully.
    Loaded/Unloaded shared object successfully.
    Segmentation fault

This has quickly become a bug of nightmares: occurs sporadically on some machines, never on others, disappears with strace, etc. I can assure you, any help or advice is appreciated! :)

I think this is a known issue related to “dlopen”. Details are documented in NVIDIA’s README file.
http://us.download.nvidia.com/XFree86/Linux-x86_64/340.24/README/knownissues.html

I’ve tested your code on debian/unstable 64bit with 340.24 driver.
On my environment, the crash is fixed by just adding “-pthread” to LIBS in Makefile.

(FYI; strictly speaking, I’ve added “-ldl -pthread” in LIBS. “-ldl” is also required to fix implict DSO issue)

Thanks for the feedback, pyopyopyo.

I had seen the known issues related to “dlopen” before, but I had missed the section on pthreads. Thank you for bringing that to my attention.

Unfortunately, I still receive the segfault even with -lpthread in the link line. I’ve tested this using the Ubuntu 12.04.1 and SUSE Linux Enterprise Desktop 11 machines listed in my original post. Specifically, I updated “Makefile” to look like:

LIBS = -ldl -lpthread -lGL -lGLU -lglut

glutCrashApp: main.cpp GLUtility.cpp
	g++ main.cpp GLUtility.cpp $(LIBS) -o glutSampleApp

I cleaned/rebuilt the app, but the segfault still occurs. ldd does report that libpthread.so.0 is required by glutSampleApp. However, it does even without specifically linking -lpthread.

In the “known issues” article you linked, it says that dlopen() will crash when loading any other library that is linked against libpthread. However, in this case, SimpleLibrary.so is not linked against -lpthread. Should this still be a problem? Also, I’m not dynamically loading libGL; rather, I’m just using -lGL (though, perhaps another library indirectly loads it later).

EDIT
I should also mention that the main application I’m working on that spawned this issue does link with lpthread. Also, I see that pyopyopyo originally said “-pthread” without the ‘l’. I have tried both to the same effect.

OpenGL version is 3.3.0 for the GT 240 and 4.4.0 for the Tesla K20. Increasing DefaultNumPts makes no difference and your test lib loads and unloads continuously -> no segfault.

But from your bug report file:

[   14.015406] vesafb: mode is 1280x1024x32, linelength=5120, pages=0
[   14.015407] vesafb: scrolling: redraw
[   14.015409] vesafb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[   14.018754] vesafb: framebuffer at 0xf1000000, mapped to 0xffffc90006800000, using 5120k, total 5120k
[   14.018848] Console: switching to colour frame buffer device 160x64
...
[   22.327419] NVRM: Your system is not currently configured to drive a VGA console
[   22.327422] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[   22.327423] NVRM: requires the use of a text-mode VGA console. Use of other console
[   22.327425] NVRM: drivers including, but not limited to, vesafb, may result in
[   22.327426] NVRM: corruption and stability problems, and is not supported.

Thanks again Chemal.

From my experience and yours, I’m guessing that CentOS just doesn’t exhibit the crash. Thank you for going the extra mile to test it.

I looked up information on that warning you posted. Here is some of the stuff I found:
http://www.nvnews.net/vbulletin/showthread.php?t=184614
https://devtalk.nvidia.com/default/topic/548475/vga-console-warning-/

In those, people suggest adding the following to the kernel boot parameters:

"video=vesa:off vga=normal"

I added this line and rebooted. Sure enough, the warning disappeared from the dmesg portion of the log (I’ve attached the new nvidia-bug-report.log.gz).

Unfortunately, even with the update, the segfault still occurs when running the sample app. I tested this on the Ubuntu machine listed in my original post.

nvidia-bug-report.log.gz (66.6 KB)

I gave it a try, and I can reproduce your segfault. The number of 2’s I have to hit vary,
but it does segfault eventually (more when building with -Og -gddb, usually 2 when building with -O2).
Under gdb it runs without any problems :)
(glibc-2.20 git, 340.24 driver on x64).

Thank you for taking the time to test, mlau!

If you don’t mind, a few questions:

  1. When you say "number of 2's I have to hit vary", I assume you are also hitting '1' in between, correct? I have to alternate between '1' and '2' to get the segfault. Just hitting '2' doesn't crash for me.
  2. Did you try building with -pthread as pyopyopyo suggested above? Does that solve the problem for you?
  3. What linux distro are you running?

I was wondering if it would crash on the latest drivers. The latest I have tested are 319.60. Thanks for testing that for me. ;)

Yeah, it doesn’t crash when run under gdb, valgrind, or strace.

No, i always hit the '2’s in a row. -O2 takes 2 keypresses, -Og takes 5 or 6 for a segfault.
-pthread and -lpthread are treated the same by gcc, there should be no difference.
running gentoo x64 with latest dev versions of gcc/glibc/binutils/X stack.

One final thing: You’re not properly building your shared lib. It’s C++ and therefore you nedd g++ for proper linkage.

g++ -fPIC -shared -o SimpleLibrary.so SimpleLibrary.cc

@mlau
That’s strange. Do you draw any lines (pressing ‘1’) before receiving the crash?

@chemal
I tried compiling the library with g++. Unfortunately, I receive the same segfault. Thanks for the correction though.

I am having the same problem in my application, and I can reproduce the problem with your sample app after about 4 key presses.

Red Hat 5.10 Tikanga
Dell Precision T3600 w/ Quadro 600
GL/Driver: 4.40 / 331.49

Have you had any luck resolving this yet?

Seeing this thread pop up again, I gave it another try.
I can no longer reproduce the crash, it now works perfectly fine.
(340.46 driver on x64, latest glibc-git and binutils-git).