XOpenDisplay takes a lot of time to return on amazon gpu cluster instance

Hi,
I have installed nvidia OpenGL drivers on amazon gpu cluster instance having Nvidia Tesla M2050 GPUs and I see a weird behavior when I run any two apps in quick succession. The XOpenDisplay takes a long long time to return. The driver versions that I have tried are NVIDIA-Linux-x86_64-319.23.run and NVIDIA-Linux-x86_64-319.32.run.

I am running the server in a headless mode with xorg.conf generated using the following options:

sudo nvidia-xconfig -a --no-xinerama --no-dynamic-twinview --use-display-device=None --virtual=1280x1024

which maps two screens each of them to two devices.

Finally I run X Server using:

sudo /usr/bin/X :0

and then I run any application on any one of the GPUs using:

export DISPLAY=:0.0 glxgears

or

export DISPLAY=:0.1 glxgears

To test the behavior I created a simple application which does nothing but opens and then closes X display in quick succession and noted that after doing 1st XOpenDisplay, it takes a really long long time before the call to XOpenDisplay returns.

The code for this is something like this:

int i;
    //Run the following multiple times and measure the rate / sec.
    for (i = 0; i < count; i++) {
        gDpy = XOpenDisplay(NULL);
        if (!gDpy) {
            printf("Error: couldn't open default X display.\n");
            return;
        }
        XCloseDisplay(gDpy);
    }

...
...

Additional code to invoke the above loop and measure the rate/sec for above.

I am attaching the nvidia-bug-report with this for reference.

Is it a problem with my setup or the drivers for Tesla? Any help is appreciated.

Regards,
Divick
nvidia-bug-report.log.gz (104 KB)

Also I notice that when an application exits the output is similar to the following on the terminal where I started the X. Only when this is dumped when the previous application exits / closes the display, the XOpenDisplay succeeds.

12 XSELINUXs still allocated at reset
SCREEN: 0 objects of 344 bytes = 0 total bytes 0 private allocs
COLORMAP: 0 objects of 8 bytes = 0 total bytes 0 private allocs
DEVICE: 0 objects of 104 bytes = 0 total bytes 0 private allocs
CLIENT: 0 objects of 152 bytes = 0 total bytes 0 private allocs
WINDOW: 0 objects of 72 bytes = 0 total bytes 0 private allocs
PIXMAP: 4 objects of 112 bytes = 448 total bytes 0 private allocs
GC: 8 objects of 40 bytes = 320 total bytes 0 private allocs
CURSOR: 0 objects of 16 bytes = 0 total bytes 0 private allocs
DBE_WINDOW: 0 objects of 24 bytes = 0 total bytes 0 private allocs
GLYPH: 0 objects of 48 bytes = 0 total bytes 0 private allocs
PICTURE: 0 objects of 8 bytes = 0 total bytes 0 private allocs
SYNC_FENCE: 0 objects of 8 bytes = 0 total bytes 0 private allocs
TOTAL: 12 objects, 768 bytes, 0 allocs
4 PIXMAPs still allocated at reset
PIXMAP: 4 objects of 112 bytes = 448 total bytes 0 private allocs
GC: 8 objects of 40 bytes = 320 total bytes 0 private allocs
CURSOR: 0 objects of 16 bytes = 0 total bytes 0 private allocs
DBE_WINDOW: 0 objects of 24 bytes = 0 total bytes 0 private allocs
GLYPH: 0 objects of 48 bytes = 0 total bytes 0 private allocs
PICTURE: 0 objects of 8 bytes = 0 total bytes 0 private allocs
SYNC_FENCE: 0 objects of 8 bytes = 0 total bytes 0 private allocs
TOTAL: 12 objects, 768 bytes, 0 allocs
8 GCs still allocated at reset
GC: 8 objects of 40 bytes = 320 total bytes 0 private allocs
CURSOR: 0 objects of 16 bytes = 0 total bytes 0 private allocs
DBE_WINDOW: 0 objects of 24 bytes = 0 total bytes 0 private allocs
GLYPH: 0 objects of 48 bytes = 0 total bytes 0 private allocs
PICTURE: 0 objects of 8 bytes = 0 total bytes 0 private allocs
SYNC_FENCE: 0 objects of 8 bytes = 0 total bytes 0 private allocs
TOTAL: 8 objects, 320 bytes, 0 allocs

Anyone who could help?

With -noreset option to XServer I see that the issue is resolved. I should rather say that this is a workaround rather then the resolution. Could someone from Nvidia please confirm that if it is an issue with Nvidia drivers or Xorg xserver?

The X server’s default behavior is to reset when the last client disconnects. That requires re-initializing the GPU, which can take a moment. If you don’t need the server to reset, then passing -noreset is the right option.