64-bit app ok, 32-bit app slow (343.22)

Brief

Symptom:
32-bit apps get very slow OpenGL rendering. 32-bit glxgears get around 3 fps (SIC!). 64-bit apps work ok.

Nvidia drivers:
Tried 340.46 and 343.22 and both suffered the same symptoms.

System:
New system, Haswell-E CPU, Nvidia GeForce GTX 750 Ti GPU. No prior working configuration.

Operating system
Up to date Debian Sid, kernel 3.16.5 from distro and also tried a self made 3.17.1. NVidia drivers from distro install, but verified that the binaries match the nvidia installer provided ones.

Detail
I was having problems getting 3D to work in Wine, so I narrowed down the problem to using a 32-bit glxgears and got only 3fps with it. It’s displaying the gears moving slowly though…
(and obviously the 64-bit version of glxgears gets a decent 17 kfps :) )

glxgears seems to use nvidia:s .so:s as seen in ldd:

linux-gate.so.1 (0xf7773000)
        libGLEW.so.1.10 => /usr/lib/i386-linux-gnu/libGLEW.so.1.10 (0xf76d3000)
        libGLU.so.1 => /usr/lib/i386-linux-gnu/libGLU.so.1 (0x4906c000)
        libGL.so.1 => /usr/lib/i386-linux-gnu/libGL.so.1 (0xf75b9000)
        libm.so.6 => /lib/i386-linux-gnu/i686/cmov/libm.so.6 (0xf7573000)
        libX11.so.6 => /usr/lib/i386-linux-gnu/libX11.so.6 (0xf7421000)
        libXext.so.6 => /usr/lib/i386-linux-gnu/libXext.so.6 (0xf740c000)
        libc.so.6 => /lib/i386-linux-gnu/i686/cmov/libc.so.6 (0xf7261000)
        libXmu.so.6 => /usr/lib/i386-linux-gnu/libXmu.so.6 (0x496e6000)
        libXi.so.6 => /usr/lib/i386-linux-gnu/libXi.so.6 (0xf724e000)
        libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xf715c000)
        libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xf713f000)
        libnvidia-tls.so.343.22 => /usr/lib/i386-linux-gnu/tls/libnvidia-tls.so.343.22 (0xf713a000)
        libnvidia-glcore.so.343.22 => /usr/lib/i386-linux-gnu/libnvidia-glcore.so.343.22 (0xf4b2f000)
        libdl.so.2 => /lib/i386-linux-gnu/i686/cmov/libdl.so.2 (0xf4b2a000)
        /lib/ld-linux.so.2 (0xf7774000)
        libxcb.so.1 => /usr/lib/i386-linux-gnu/libxcb.so.1 (0x44fee000)
        libXt.so.6 => /usr/lib/i386-linux-gnu/libXt.so.6 (0x453d3000)
        libXau.so.6 => /usr/lib/i386-linux-gnu/libXau.so.6 (0x4501a000)
        libXdmcp.so.6 => /usr/lib/i386-linux-gnu/libXdmcp.so.6 (0x45012000)
        libSM.so.6 => /usr/lib/i386-linux-gnu/libSM.so.6 (0xf4b1f000)
        libICE.so.6 => /usr/lib/i386-linux-gnu/libICE.so.6 (0xf4b01000)
        libuuid.so.1 => /lib/i386-linux-gnu/libuuid.so.1 (0xf4afb000)

strace also shows that it’s opening /dev/nvidiactl and /dev/nvidia0 sucessfully. And probably submitting, albeit slowly, data to the gpu (fd 4 == /dev/nvidiactl):

1414512607.989459 time(NULL)            = 1414512607
1414512607.989495 ioctl(4, 0xc030464e, 0xffd01820) = 0
1414512607.989568 time(NULL)            = 1414512607
1414512607.989603 ioctl(4, 0xc020464f, 0xffd01850) = 0
1414512607.989668 time(NULL)            = 1414512607
1414512607.989703 ioctl(4, 0xc030464e, 0xffd01820) = 0
1414512607.989774 time(NULL)            = 1414512607
1414512607.989808 ioctl(4, 0xc020464f, 0xffd01850) = 0
.. lots of yielding..
1414512608.000679 time(NULL)            = 1414512608
1414512608.000714 ioctl(4, 0xc030464e, 0xffd054d0) = 0
1414512608.000790 time(NULL)            = 1414512608
1414512608.000826 ioctl(4, 0xc020464f, 0xffd05500) = 0
1414512608.000887 time(NULL)            = 1414512608
1414512608.000923 ioctl(4, 0xc030464e, 0xffd054d0) = 0
1414512608.000993 time(NULL)            = 1414512608
1414512608.001028 ioctl(4, 0xc020464f, 0xffd05500) = 0

[UPDATE: thanks tuke81 I can’t seem to locate the attach file button here, so in case the nvidia-bug-report.log.gz is crucial I may have to try harder.]

[UPDATE]
Attached nvidia-bug-report and glxinfo output from 32-bit, which is identical to the 64-bit one (and direct rendering: Yes).

[UPDATE 2014-10-30]
ltrace shows me that glXSwapBuffer is taking >120ms to complete in 32-bit glxgears when the 64-bit version is almost instant (mostly < 1ms). Also the 32-bit glxgears is doing ioctl:s on every glXSwapBuffer when the 64-bit version is doing it seldomly.
Attaching ltrace from both 32-bit and 64-bit glxgears (using no YIELD:ing for cleaner trace).

__GL_YIELD="NOTHING" ltrace -S -T -n 2 -ttt -o glxgears-64-ltrace-out.txt glxgears
__GL_YIELD="NOTHING" ltrace -S -T -n 2 -ttt -o glxgears-32-ltrace-out.txt ./glxgears # 32-bit

nvidia-bug-report.log.gz (139 KB)
glxinfo-32-out.txt (49.2 KB)
glxgears-32-ltrace-out.txt (1.24 MB)
glxgears-64-ltrace-out.txt.gz (661 KB)

Attach file paperclip icon appears after you post, hoover mouse over right corner.

Download 32bit mesa-utils package and unzip glxinfo from it, you should rename it i.e. glxinfo32 and run it ./glxinfo32(might need to make it executable chmod +x glxinfo32)

Thanks, post now updated with attachments.

I forgot to mention that I already did that, and the output was identical to the 64-bit version. But output added as an attachment in the original post for clarity.

And continuing with more debug info:
Ran perf_3.16 record -e instructions:u on both the 32-bit glxgears and the 64-bit glxgears. The 64-bit one spends most of it’s time in kernel space, but the 32-bit one spends most of it’s time in libnvidia-glcore.so in a tight loop in a function _nv005glcore+0x55080 (if I’m reading it correctly, first time perf annotate user…). Attached image of perf annotate hot spot on the 32-bit glxgears.

Deeper than this is hard to get without particular knowledge of the driver.
Why is it looping there? I’m hoping somebody at NVidia has the time to look into this.

I am seeing the exact same problem.
Is there any update on this bug?

It works fine and as fast as expected here. I just downloaded the source (http://www.opensource.apple.com/source/X11apps/X11apps-14/glxgears.c – sorry, google’s first hit) and compiled a 32-bit binary on my 64-bit machine:

gcc -O -m32 glxgears.c -lGL -lX11 -lm

My guess is that you don’t have the 32-bit parts of the nvidia driver properly (or not at all?) installed and are using software rendering with support from the mess you have created yourself.