Opengl corruption?

Running on AMD64 Gentoo (yes, yes, I know it’s unsupported, but it’s working fine) with the latest CUDA 1.0. Though, I observed this problem in 0.9 too. Hardware: 8800GTX

After running any CUDA program, it seems that the opengl driver is corrupted. If I attempt to run PyMOL after, everything works fine for a few seconds and then the machine either stalls for a minute or hangs completely. A reboot solves this problem.

Just wondering if anyone else is experiencing this?

Specifically which display driver are you using?

Are you confident that this problem is only present after using CUDA?

Is this problem specific to running PyMOL, or does it reproduce with any OpenGL application (even glxgears)?

Please generate and attach an nvidia-bug-report.log after reproducing this problem.

I’m currently using driver: 100.14.11 (confirmed using glxinfo).
I am 100% confident that this problem occurs after using CUDA.

However, now that I have tried to reproduce the problem, I realize what is most likely causing it: my dumb CUDA code. Just running any old CUDA program doesn’t cause this problem. I need to run one that writes past the end of a global memory array (originally not on purpose, of course). Sometimes, the CUDA program runs without errors when this happens, and sometimes I get an unspecified launch failiure.

I checked it out with glxgears after running a CUDA program that wrote paste the end of an array and the system hung completely.

If you still want the bug report, let me know and I can try to cause a failure that doesn’t hang my system completely so I can generate one.

/me is off to implement overflow detection into the misbehaving code so this doesn’t happen again :)

If you’re confident that this is a bug in your code, rather than a driver bug, then there’s no need to provide any further information. Thanks.

Update: I managed to reproduce the crashing of glxgears after one of my CUDA programs ran. This time, my CUDA program didn’t write past the end of any arrays (99.9% confident of this). In the bug report log I generated to post here, I noticed this message:
Aug 1 11:32:26 joaander NVRM: API mismatch: the client has the version 100.14.10, but
Aug 1 11:32:26 joaander NVRM: this kernel module has the version 100.14.11. Please
Aug 1 11:32:26 joaander NVRM: make sure that this kernel module and all NVIDIA driver
Aug 1 11:32:26 joaander NVRM: components have the same version.

Obviously, my driver upgrade from 0.9 to 1.0 didn’t go so well. I uninstalled and then reinstalled the driver and now everything seems find (knock on wood). In fact, another issue I was having went away.

Just recording the final solution here in case anyone else has similar issues.
Edit: I spoke too soon. It wasn’t the drivers that needed to be reinstalled. It was the cuda TOOLKIT that was actually the problem. The error message quoted above showed up every time I ran a cuda program. After redownloading and reinstalling the cuda toolkit, I can run cuda programs without error messages now.