kernel related problems ?


As I already posted on another thread we are having problems with this server: equipped with two Teslas M2090.

Running on...

Device 0: Tesla M2090

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2476.2

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3141.6

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     140591.6

[bandwidthTest] test results...


> exiting in 3 seconds: 3...2...1...done!

real    0m8.582s

user    0m0.182s

sys     0m2.401s

If you look at the sys time you will immediately see there is something wrong. Just for comparison, on a fairly old laptop the results are:

real    0m3.986s

user    0m0.640s

sys     0m0.336s

Anything related to cuda exhibits stalls or slowdowns which are explained by the abnormal high sys times

We were advised to try an older version of centOS (5.8) and the problem went away but now we are stuck with a really old system. We have tried centOS 6.2 with its standard kernel, centOS 6.2 with a 3.4 kernel, Fedora 17 and we always get the same bad results.

Does anybody have a clue how we could track down the problem. I really thought a modern kernel (3.4) would also solve the problem but it was not the case

Thank you vey much in advance.

Are you running the driver on the server in persistent mode, to prevent the unloading of the driver when not in use? On the laptop the driver is in continuous use (as there is a graphics desktop, presumably) so it stays loaded. To turn on persistent mode, use nvidia-smi -pm.

Thanks for your reply.

Yes, the driver is in persistent mode. That was one of the first things we tried and it made a big difference, but the results are still far from normal