Tesla m2090 performing as exptected ?

Hi,

We have recently purchased a Supermicro server equipped with 2 tesla m2090 (http://www.supermicro.com/products/system/2u/2026/sys-2026gt-trf.cfm). We had centOS 6.2 installed and were having terrible performance problems as any Cuda applications exhibited an abnormal delay when running due to a high “system cpu time” (measured with the time command). We tested everything we could find in these forums and nothing helped (persistent mode, numactl,…). Finally, we tested centOS 5.8 in order to try a different kernel and this solved the problem which means the initial problem was an kernel related issue.

We are running several tests to make sure we are really obtaining the full expected performance and these are the bandwidthTest results:

Running on...

Device 0: Tesla M2090

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3737.1

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3186.3

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     141235.0

which compared to the results on a development PC (Core i7-2600, and kubuntu 11.10) equipped with an GeForce GTX 560 Ti seem a bit low:

Device 0: GeForce GTX 560 Ti

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5969.8

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5330.4

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     104888.3

Could anybody confirm if the host to device and device to host bandwidth can be considered normal ? (I do get of course better results on both systems using pinned memory)

Thank you very much in advance

To check whether PCIe in configured correctly, it’s better to check the pinned memory bandwidth, which should be in the vicinity of 6 GB/sec. The paged memory bandwidth will depend on the performance of the host system, as it involves copying data in system memory (user data <-> pinned DMA buffer). Make sure you control for NUMA issues, because tranfer speeds will be affected if system memory and GPU are attached to different CPUs. I posted some host<->device throughput data for a system with an M2090 recently:

Thank you very much for your reply. I had in fact already read that post but as I know paged memory transfers depend on general system performance, I was wondering is someone could give his opinion about whether the performance I was getting was reasonable considering the Supermicro server we have.

Regarding pinned memory transfers:

time -p numactl -m 0 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --memory=pinned --device=0

[bandwidthTest] starting...

/home/mlastra/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2090

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5786.2

Device to Host Bandwidth, 1 Device(s), Pinned memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5914.6

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     141275.9

[bandwidthTest] test results...

PASSED

> exiting in 3 seconds: ^[[A3...2...1...done!

real 3.69

user 0.26

sys 0.39

time -p numactl -m 1 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --memory=pinned --device=1

[bandwidthTest] starting...

/home/mlastra/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 1: Tesla M2090

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5732.6

Device to Host Bandwidth, 1 Device(s), Pinned memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5912.7

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     141361.7

[bandwidthTest] test results...

PASSED

> exiting in 3 seconds: 3...2...1...done!

real 3.73

user 0.26

sys 0.44