Using GPUs on high performance machines

brushman · February 6, 2013, 5:14am

Hi everyone, I am considering converting a large MPI physics simulation to a GPU hybrid code using CUDA and/or OpenACC.

The code will be run on many nodes, each of which may contain, for example, 8 CPU cores and 1 GPU.
The code is written so that each CPU core executes an MPI process. However, this brings in the problem of those 8 CPUs fighting over that 1 GPU. How is this dealt with? Can the GPUs accelerate the code so much that if I only ran 1 CPU and 1 GPU per node my code would still run faster?

Titan, the new supercomputer at Oak Ridge, for example, has 16 CPU cores / GPU. So clearly there must be some solutions to this problem.

eyalhir74 · February 6, 2013, 9:39am

Hi,
The best scenario should be one CPU per GPU and in fact you’d like that configuration
to get the best for your money. You’d like to squeeze in as many GPUs in one machine
such that you minimize the amount of money you put on the machine’s hardware.

That being said, look for the HyperQ feature the new high-end Kepler gives, that should
give you the ability to post tasks from multiple CPU cores into one Kepler GPU.

eyal

pasoleatis · February 6, 2013, 11:05am

Hello,

You could use the rest of the cpu to something useful, like some analysis, statistics, structure factors… If the 1 cpu gives enough calculations to the gpu to fill it, the others will have to wait anyway. Taking in to account you can get speed-ups up to 100 x or more, you are not going to need the rest of 7. YOu can submit the whole work from 1 process. For the internode gpu communications you should try to take advantange of the gpudirect feature.

seibert · February 6, 2013, 2:14pm

How much of your execution time is spent running the GPU? If there is a large amount of work also happening on the CPU, sharing 1 GPU between many processes is not a terrible choice. I did this with a program that used the GPU for one stage of a processing pipeline. This stage dominated the runtime before, but once I moved it to the GPU, the rest of the processing stages became the bottleneck. As a result, the GPU was only being used about 10% of the time by one process, so I was able to run 8 copies of the program on the same GPU with no problems.

The context switch on the GPU does have some overhead, so it is not the best option for programs with very high CUDA utilization.

On a machine like Titan, the amount of FLOPS provided by the CPUs is so low compared to the GPUs that it would be quite sensible to leave most of the CPU cores idle in a program that spends nearly 100% of its time on the GPU. The extra cores are there to provide flexibility to adapt to programs with different ratios of CPU to GPU work.

Jimmy_Pettersson · February 8, 2013, 9:52am

I agree that the HyperQ feature should help you out a lot. It was added with MPI in mind. To my understanding it allows multiple CPUs to actually share the GPU resources at the same time ie many different kernels and streams are concurrently executing.

Topic		Replies	Views
Multi-CPU + 2 GPU on computational expensive process CUDA Programming and Performance	0	400	June 28, 2021
More cores than GPUs CUDA Programming and Performance	4	3604	June 2, 2009
How to make a GPU socket multiple cpu cores? CUDA Programming and Performance	0	469	July 26, 2017
Question about CUDA+MPI Legacy PGI Compilers	3	2707	March 13, 2018
high efficiency when running multiple jobs simultaneously on one GPU what does this indicate? CUDA Programming and Performance	5	1274	October 13, 2010
CUDA using Multiple devices CUDA Programming and Performance	5	3393	June 22, 2009
Needed strong CPU to max occupancy? CUDA Programming and Performance	1	540	February 17, 2017
Using MPI with CUDA C CUDA Programming and Performance	2	1378	December 5, 2009
CPU Cores Per GPUs CUDA Programming and Performance	11	2635	April 14, 2013
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32527	December 13, 2010

Using GPUs on high performance machines

Related topics