We have recently purchased a Supermicro server equipped with 2 tesla m2090 (http://www.supermicro.com/products/system/2u/2026/sys-2026gt-trf.cfm). We had centOS 6.2 installed and were having terrible performance problems as any Cuda applications exhibited an abnormal delay when running due to a high “system cpu time” (measured with the time command). We tested everything we could find in these forums and nothing helped (persistent mode, numactl,…). Finally, we tested centOS 5.8 in order to try a different kernel and this solved the problem which means the initial problem was an kernel related issue.
We are running several tests to make sure we are really obtaining the full expected performance and these are the bandwidthTest results:
Running on...
Device 0: Tesla M2090
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3737.1
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3186.3
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 141235.0
which compared to the results on a development PC (Core i7-2600, and kubuntu 11.10) equipped with an GeForce GTX 560 Ti seem a bit low:
Device 0: GeForce GTX 560 Ti
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5969.8
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5330.4
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 104888.3
Could anybody confirm if the host to device and device to host bandwidth can be considered normal ? (I do get of course better results on both systems using pinned memory)
To check whether PCIe in configured correctly, it’s better to check the pinned memory bandwidth, which should be in the vicinity of 6 GB/sec. The paged memory bandwidth will depend on the performance of the host system, as it involves copying data in system memory (user data <-> pinned DMA buffer). Make sure you control for NUMA issues, because tranfer speeds will be affected if system memory and GPU are attached to different CPUs. I posted some host<->device throughput data for a system with an M2090 recently:
Thank you very much for your reply. I had in fact already read that post but as I know paged memory transfers depend on general system performance, I was wondering is someone could give his opinion about whether the performance I was getting was reasonable considering the Supermicro server we have.
Regarding pinned memory transfers:
time -p numactl -m 0 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --memory=pinned --device=0
[bandwidthTest] starting...
/home/mlastra/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...
Running on...
Device 0: Tesla M2090
Quick Mode
Host to Device Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5786.2
Device to Host Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5914.6
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 141275.9
[bandwidthTest] test results...
PASSED
> exiting in 3 seconds: ^[[A3...2...1...done!
real 3.69
user 0.26
sys 0.39
time -p numactl -m 1 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --memory=pinned --device=1
[bandwidthTest] starting...
/home/mlastra/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...
Running on...
Device 1: Tesla M2090
Quick Mode
Host to Device Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5732.6
Device to Host Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5912.7
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 141361.7
[bandwidthTest] test results...
PASSED
> exiting in 3 seconds: 3...2...1...done!
real 3.73
user 0.26
sys 0.44