Linux crash

Hi,

I have a linux system (yet another one :)) which crashes from time to time… I see the following in the /var/log/messages file:

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105120000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: please see the README section on Cache Aliasing for more information

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105121000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105122000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105123000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105124000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105125000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105126000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810105127000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810106190000: actual 0x73 != expected 0x163

Oct 12 21:31:23 gpu-ws3 kernel: NVRM: bad caching on address 0xffff810106191000: actual 0x73 != expected 0x163

Oct 13 14:05:10 gpu-ws3 syslogd 1.4.1: restart.

Any idea what might be causing this? Is that related to some sort of faulty cudaMemalloc maybe?

thanks

eyal

If you follow the advice of the dmesg text and look in the readme file, you will see this:

I am going to go out on a limb and guess you are running Redhat EL5.0-5.3 or a clone thereof (like Centos, Scientific Linux, Rocks Cluster etc). That uses a very old kernel (circa 2.6.18) with a series of backports for bug fixes and more modern features. It seems that the page/cache fix that appeared in about 2.6.25 isn’t amongst those.

Yes indeed we’re using Redhat EL5.2 with 2.6.18. What would you suggest as the best/preferable version for Redhat?

excuse my lack of IT/system knowledge :)

thanks

eyal

If you are using Redhat (which I presume means you have a support contract), then you don’t have a choice. Every version of Redhat 5 uses the same kernel from their own 2.6.18 based tree, with their own backports. You won’t get support if you try anything else.

We needed a more modern kernel (for a number of reasons) on our cluster, so we are running Centos 5.2 (RHEL 5.2 de-branded and built from source) with a kernel built from the Fedora Core 10 tree, which is based 2.6.27, IIRC.

If this is really a problem, then I suggest talking to Redhat support.

Thanks a lot :)

We’ll try the Centos way… probably will take time to do so but if this will bring stability then its worth it.

thanks

eyal