MultiGPU information

eyalhir74 · June 4, 2009, 7:17am

Hi,

I wanted to share some results of my recent tests with a multi-gpu system. Those results are relevant, of course, to our

algorithm and might differ for other people.

I’ve tested two multi-gpu machines (all linux machines):

Intel machine - 2xQuad core (for total of 8 CPU cores), 3 x GTX295 (for total of 6 GPUs).
AMD machine - 1xQuad core (for total of 4 CPU cores), 4 x GTX295 (for total of 8 GPUs).

Below are the results I’ve measured (time in minutes):

Factors			

	Input #	1 GPU	2 GPU	4 GPU	6 GPU	8 GPU		1GPU vs 2	1GPU vs 4	1GPU vs 6	1GPU vs 8

gpu-ws5	1	08:35:56	04:22:00	02:15:50	01:40:00	01:21:00		1.97	3.80	5.16	6.37

CUDA 2.2	2	08:33:10	04:19:20	02:13:00	01:37:00	01:17:00		1.98	3.86	5.29	6.66

	3	08:33:13	04:19:15	02:12:00	01:36:00	01:17:00		1.98	3.89	5.35	6.67

	4	08:33:55	04:19:11	02:13:01	01:37:00	01:17:00		1.98	3.86	5.30	6.67

	5	08:33:22	04:19:14	02:13:00	01:37:00	01:17:00		1.98	3.86	5.29	6.67

	6	08:33:50	04:19:10	02:13:00	01:37:00	01:16:00		1.98	3.86	5.30	6.76

	7	08:33:27	04:19:10	02:13:00	01:37:00	01:17:00		1.98	3.86	5.29	6.67

	8	08:33:22	04:19:10	02:13:00	01:36:00	01:17:00		1.98	3.86	5.35	6.67

	9	08:33:47	04:19:10	02:13:00	01:36:00	01:17:00		1.98	3.86	5.35	6.67

Total		01:17:02	00:38:55	00:19:58	00:14:32	00:11:35		1.98	3.86	5.30	6.65

											

											

		1 GPU(2.2)	2 GPU	4 GPU	6 GPU						

gpu-ws4 	1	08:35:00		02:15:50	01:32:55						

CUDA 2.2	2	08:32:10		02:12:00	01:29:50						

	3										

Total		01:38:17	00:38:55	00:19:57	00:13:33

Some observations and analysis of the results:

As you can see the results between the two machines were pretty much the same.
On the AMD machine, the fact that there were more GPUs then CPU cores, didnt make any difference.
Tests were done on CUDA 2.2. I saw ~10-15% performance boost when moving from 2.0 to 2.2
2 GPUs (that is one GTX295) runs almost twice as fast as 1 GPU (1 half of a GTX295)
4 GPUs (2 GTX295) runs x3.86 faster than 1 GPU( 1 half of a GTX295)
6 GPUs (3 GTX295) runs x5.3 faster than 1 GPU.
8 GPUs (4 GTX295) runs ~x6.66 faster than 1 GPU.
I think I dont see fully linear improvement is due to issues not related to the GPU, but rather to my CPU thread handling code and

the overhead of calculating short tasks (hi - 8 GPUs run so fast, it takes a task to complete in only 1 minute !! :) ).
I’ve had a big problem in the previous version of the code, where 8 GPUs ran only ~x4-4.5 times faster than 1 GPU. 4 GPUs ran ~x2.5-3 faster than 1 GPU.

After redisigning (and more importantly identifing the bottleneck - surprise, surprise - the PCI bus), I’ve moved some serial code from the CPU to the GPU.

This change resulted in copying ~500K data from device to host (the result of the algorithm) instead of copying a couple of MBs of data from device to host.

This opened up the bottleneck in a very nice way… :)

Hope this helps people :)

All in all - CUDA, GPUs and nVidia rocks !!!

Here are the full details of the systems:

GPU-ws4:

CPU : 2x Intel Xeon 5420

Chipset : Intel D5400XS

RAM : 16Gb

PCI-E : 4x PCi-E 1.1 x16 (we may use only 3 slots due to physical board limitations)

GPU-ws5:

CPU : AMD Phenom 9850

Chipset : Nvidia nForce 780a

RAM : 8Gb

PCI-E : 2x16 PCI-E 2.0 and 2x8 PCI-E 2.0

eyal

Vishu · June 8, 2009, 3:21pm

That’s interesting. How long do your codes last? Is it safe to run (in terms of GPUs not burning down) codes for long time – like several days – on GTX’s? In our experience, its been a bit problematic and therefore we prefer Teslas which seem quite stable.

eyalhir74 · June 8, 2009, 6:26pm

Its been running at least a week straight without any problem - we’re still not in production

so from time to time I manually stop the application.

As far as I could see the gtx295 performed well i still dont have months of 24x7 experience though :)

What’s your experience with that?? the tesla’s are much more expensive compared to the gtx295…

eyal

Jeremy_Enos · June 8, 2009, 10:15pm

If it’s a NUMA system, you might want to run your benchmarks and taskset the thread to a core nearest the GPU device in use. We’ve seen as much as 25% difference in htod bw this way. It’s the reason we have created a wrapper library to do that for us.
Just recently posted to SF:
[url=“cuda_wrapper download | SourceForge.net”]https://sourceforge.net/projects/cudawrapper/[/url]
The hope is that Nvidia ends up noticing the practicality of features we incorporate in the wrapper and then they pull them in for themselves.

Jeremy

Topic		Replies	Views
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10993	January 18, 2008
Speed problems with multi-gpu on GTX295 CUDA Programming and Performance	6	3209	January 5, 2010
Performance with multiGPU ... and the 9800 GX2. CUDA Programming and Performance	4	8050	October 22, 2008
Multi GPU not working as expected - please comment CUDA Programming and Performance	10	38544	June 17, 2009
Weird multiGPU performance About 10 times slower than single GPU CUDA Programming and Performance	10	4081	November 25, 2009
Multiple GPU speed problem CUDA Programming and Performance	4	1862	November 23, 2009
high efficiency when running multiple jobs simultaneously on one GPU what does this indicate? CUDA Programming and Performance	5	1302	October 13, 2010
simpleMultiGPU processing time slower on dual than single? CUDA Programming and Performance	4	2343	November 30, 2008
MultiGPU performance CUDA Programming and Performance	4	2239	May 21, 2009
Measuring time CUDA Programming and Performance	0	5267	August 19, 2010

MultiGPU information

Related topics