MultiGPU information

Hi,

I wanted to share some results of my recent tests with a multi-gpu system. Those results are relevant, of course, to our

algorithm and might differ for other people.

I’ve tested two multi-gpu machines (all linux machines):

  1. Intel machine - 2xQuad core (for total of 8 CPU cores), 3 x GTX295 (for total of 6 GPUs).

  2. AMD machine - 1xQuad core (for total of 4 CPU cores), 4 x GTX295 (for total of 8 GPUs).

Below are the results I’ve measured (time in minutes):

Factors			

	Input #	1 GPU	2 GPU	4 GPU	6 GPU	8 GPU		1GPU vs 2	1GPU vs 4	1GPU vs 6	1GPU vs 8

gpu-ws5	1	08:35:56	04:22:00	02:15:50	01:40:00	01:21:00		1.97	3.80	5.16	6.37

CUDA 2.2	2	08:33:10	04:19:20	02:13:00	01:37:00	01:17:00		1.98	3.86	5.29	6.66

	3	08:33:13	04:19:15	02:12:00	01:36:00	01:17:00		1.98	3.89	5.35	6.67

	4	08:33:55	04:19:11	02:13:01	01:37:00	01:17:00		1.98	3.86	5.30	6.67

	5	08:33:22	04:19:14	02:13:00	01:37:00	01:17:00		1.98	3.86	5.29	6.67

	6	08:33:50	04:19:10	02:13:00	01:37:00	01:16:00		1.98	3.86	5.30	6.76

	7	08:33:27	04:19:10	02:13:00	01:37:00	01:17:00		1.98	3.86	5.29	6.67

	8	08:33:22	04:19:10	02:13:00	01:36:00	01:17:00		1.98	3.86	5.35	6.67

	9	08:33:47	04:19:10	02:13:00	01:36:00	01:17:00		1.98	3.86	5.35	6.67

Total		01:17:02	00:38:55	00:19:58	00:14:32	00:11:35		1.98	3.86	5.30	6.65

											

											

		1 GPU(2.2)	2 GPU	4 GPU	6 GPU						

gpu-ws4 	1	08:35:00		02:15:50	01:32:55						

CUDA 2.2	2	08:32:10		02:12:00	01:29:50						

	3										

Total		01:38:17	00:38:55	00:19:57	00:13:33

Some observations and analysis of the results:

  1. As you can see the results between the two machines were pretty much the same.

  2. On the AMD machine, the fact that there were more GPUs then CPU cores, didnt make any difference.

  3. Tests were done on CUDA 2.2. I saw ~10-15% performance boost when moving from 2.0 to 2.2

  4. 2 GPUs (that is one GTX295) runs almost twice as fast as 1 GPU (1 half of a GTX295)

  5. 4 GPUs (2 GTX295) runs x3.86 faster than 1 GPU( 1 half of a GTX295)

  6. 6 GPUs (3 GTX295) runs x5.3 faster than 1 GPU.

  7. 8 GPUs (4 GTX295) runs ~x6.66 faster than 1 GPU.

  8. I think I dont see fully linear improvement is due to issues not related to the GPU, but rather to my CPU thread handling code and

    the overhead of calculating short tasks (hi - 8 GPUs run so fast, it takes a task to complete in only 1 minute !! :) ).

  9. I’ve had a big problem in the previous version of the code, where 8 GPUs ran only ~x4-4.5 times faster than 1 GPU. 4 GPUs ran ~x2.5-3 faster than 1 GPU.

    After redisigning (and more importantly identifing the bottleneck - surprise, surprise - the PCI bus), I’ve moved some serial code from the CPU to the GPU.

    This change resulted in copying ~500K data from device to host (the result of the algorithm) instead of copying a couple of MBs of data from device to host.

    This opened up the bottleneck in a very nice way… :)

Hope this helps people :)

All in all - CUDA, GPUs and nVidia rocks !!!

Here are the full details of the systems:

GPU-ws4:

CPU : 2x Intel Xeon 5420

Chipset : Intel D5400XS

RAM : 16Gb

PCI-E : 4x PCi-E 1.1 x16 (we may use only 3 slots due to physical board limitations)

GPU-ws5:

CPU : AMD Phenom 9850

Chipset : Nvidia nForce 780a

RAM : 8Gb

PCI-E : 2x16 PCI-E 2.0 and 2x8 PCI-E 2.0

eyal

That’s interesting. How long do your codes last? Is it safe to run (in terms of GPUs not burning down) codes for long time – like several days – on GTX’s? In our experience, its been a bit problematic and therefore we prefer Teslas which seem quite stable.

Its been running at least a week straight without any problem - we’re still not in production

so from time to time I manually stop the application.

As far as I could see the gtx295 performed well i still dont have months of 24x7 experience though :)

What’s your experience with that?? the tesla’s are much more expensive compared to the gtx295…

eyal

If it’s a NUMA system, you might want to run your benchmarks and taskset the thread to a core nearest the GPU device in use. We’ve seen as much as 25% difference in htod bw this way. It’s the reason we have created a wrapper library to do that for us.
Just recently posted to SF:
https://sourceforge.net/projects/cudawrapper/
The hope is that Nvidia ends up noticing the practicality of features we incorporate in the wrapper and then they pull them in for themselves.

Jeremy