Hi everyone, I am considering converting a large MPI physics simulation to a GPU hybrid code using CUDA and/or OpenACC.
The code will be run on many nodes, each of which may contain, for example, 8 CPU cores and 1 GPU.
The code is written so that each CPU core executes an MPI process. However, this brings in the problem of those 8 CPUs fighting over that 1 GPU. How is this dealt with? Can the GPUs accelerate the code so much that if I only ran 1 CPU and 1 GPU per node my code would still run faster?
Titan, the new supercomputer at Oak Ridge, for example, has 16 CPU cores / GPU. So clearly there must be some solutions to this problem.
The best scenario should be one CPU per GPU and in fact you’d like that configuration
to get the best for your money. You’d like to squeeze in as many GPUs in one machine
such that you minimize the amount of money you put on the machine’s hardware.
That being said, look for the HyperQ feature the new high-end Kepler gives, that should
give you the ability to post tasks from multiple CPU cores into one Kepler GPU.
You could use the rest of the cpu to something useful, like some analysis, statistics, structure factors… If the 1 cpu gives enough calculations to the gpu to fill it, the others will have to wait anyway. Taking in to account you can get speed-ups up to 100 x or more, you are not going to need the rest of 7. YOu can submit the whole work from 1 process. For the internode gpu communications you should try to take advantange of the gpudirect feature.
How much of your execution time is spent running the GPU? If there is a large amount of work also happening on the CPU, sharing 1 GPU between many processes is not a terrible choice. I did this with a program that used the GPU for one stage of a processing pipeline. This stage dominated the runtime before, but once I moved it to the GPU, the rest of the processing stages became the bottleneck. As a result, the GPU was only being used about 10% of the time by one process, so I was able to run 8 copies of the program on the same GPU with no problems.
The context switch on the GPU does have some overhead, so it is not the best option for programs with very high CUDA utilization.
On a machine like Titan, the amount of FLOPS provided by the CPUs is so low compared to the GPUs that it would be quite sensible to leave most of the CPU cores idle in a program that spends nearly 100% of its time on the GPU. The extra cores are there to provide flexibility to adapt to programs with different ratios of CPU to GPU work.
I agree that the HyperQ feature should help you out a lot. It was added with MPI in mind. To my understanding it allows multiple CPUs to actually share the GPU resources at the same time ie many different kernels and streams are concurrently executing.