Hi,
I wanted to share some results of my recent tests with a multi-gpu system. Those results are relevant, of course, to our
algorithm and might differ for other people.
I’ve tested two multi-gpu machines (all linux machines):
-
Intel machine - 2xQuad core (for total of 8 CPU cores), 3 x GTX295 (for total of 6 GPUs).
-
AMD machine - 1xQuad core (for total of 4 CPU cores), 4 x GTX295 (for total of 8 GPUs).
Below are the results I’ve measured (time in minutes):
Factors
Input # 1 GPU 2 GPU 4 GPU 6 GPU 8 GPU 1GPU vs 2 1GPU vs 4 1GPU vs 6 1GPU vs 8
gpu-ws5 1 08:35:56 04:22:00 02:15:50 01:40:00 01:21:00 1.97 3.80 5.16 6.37
CUDA 2.2 2 08:33:10 04:19:20 02:13:00 01:37:00 01:17:00 1.98 3.86 5.29 6.66
3 08:33:13 04:19:15 02:12:00 01:36:00 01:17:00 1.98 3.89 5.35 6.67
4 08:33:55 04:19:11 02:13:01 01:37:00 01:17:00 1.98 3.86 5.30 6.67
5 08:33:22 04:19:14 02:13:00 01:37:00 01:17:00 1.98 3.86 5.29 6.67
6 08:33:50 04:19:10 02:13:00 01:37:00 01:16:00 1.98 3.86 5.30 6.76
7 08:33:27 04:19:10 02:13:00 01:37:00 01:17:00 1.98 3.86 5.29 6.67
8 08:33:22 04:19:10 02:13:00 01:36:00 01:17:00 1.98 3.86 5.35 6.67
9 08:33:47 04:19:10 02:13:00 01:36:00 01:17:00 1.98 3.86 5.35 6.67
Total 01:17:02 00:38:55 00:19:58 00:14:32 00:11:35 1.98 3.86 5.30 6.65
1 GPU(2.2) 2 GPU 4 GPU 6 GPU
gpu-ws4 1 08:35:00 02:15:50 01:32:55
CUDA 2.2 2 08:32:10 02:12:00 01:29:50
3
Total 01:38:17 00:38:55 00:19:57 00:13:33
Some observations and analysis of the results:
-
As you can see the results between the two machines were pretty much the same.
-
On the AMD machine, the fact that there were more GPUs then CPU cores, didnt make any difference.
-
Tests were done on CUDA 2.2. I saw ~10-15% performance boost when moving from 2.0 to 2.2
-
2 GPUs (that is one GTX295) runs almost twice as fast as 1 GPU (1 half of a GTX295)
-
4 GPUs (2 GTX295) runs x3.86 faster than 1 GPU( 1 half of a GTX295)
-
6 GPUs (3 GTX295) runs x5.3 faster than 1 GPU.
-
8 GPUs (4 GTX295) runs ~x6.66 faster than 1 GPU.
-
I think I dont see fully linear improvement is due to issues not related to the GPU, but rather to my CPU thread handling code and
the overhead of calculating short tasks (hi - 8 GPUs run so fast, it takes a task to complete in only 1 minute !! :) ).
-
I’ve had a big problem in the previous version of the code, where 8 GPUs ran only ~x4-4.5 times faster than 1 GPU. 4 GPUs ran ~x2.5-3 faster than 1 GPU.
After redisigning (and more importantly identifing the bottleneck - surprise, surprise - the PCI bus), I’ve moved some serial code from the CPU to the GPU.
This change resulted in copying ~500K data from device to host (the result of the algorithm) instead of copying a couple of MBs of data from device to host.
This opened up the bottleneck in a very nice way… :)
Hope this helps people :)
All in all - CUDA, GPUs and nVidia rocks !!!
Here are the full details of the systems:
GPU-ws4:
CPU : 2x Intel Xeon 5420
Chipset : Intel D5400XS
RAM : 16Gb
PCI-E : 4x PCi-E 1.1 x16 (we may use only 3 slots due to physical board limitations)
GPU-ws5:
CPU : AMD Phenom 9850
Chipset : Nvidia nForce 780a
RAM : 8Gb
PCI-E : 2x16 PCI-E 2.0 and 2x8 PCI-E 2.0
eyal