HyperQ and MPI

Guix · February 8, 2013, 9:45am

Hi everyone,

I’m trying to make a little CUDA sample showing the HyperQ improvement when the GPU is attacked by several MPI processes. My case is really basic: only one kernel launched on my Tesla K20 by each MPI process. The kernel does not use all the GPU capabilities (occupancy around 6%), so, theoretically some executions should be done concurrently. It seems to be easy but after many tries it is still impossible to obtain the expected behavior, all the kernels are always executed serially…

My questions:

Maybe (or surely :)) I’m forgetting something in my implementation… Is there a special trick to activate HyperQ on GK110 arch?
Does someone have a simple sample which shows me how to use HyperQ feature with MPI?

My configuration:

Ubuntu 12.04
Tesla K20
Latest CUDA driver & toolkit
Open MPI 1.4.3

Thanks for your help !
Guix

Gert-Jan · February 8, 2013, 10:56am

Occupancy only indicates how many threads are running on the GPU compared to the theoretical maximum. If you have one thread per threadblock, but for example allocate all available shared memory on an SMX to that thread, you can run only 13 threads on the 13 SMXes of your Tesla K20.

If this isn’t your bottleneck, check the output of nvidia-smi -q, and make sure the “Compute Mode” of the K20 is set correctly. I guess that in your case the “0/DEFAULT” option is the best choice.

eyalhir74 · February 8, 2013, 11:54am

The SDK should have a test case for the HyperQ feature.

eyal

DrAnderson42 · February 8, 2013, 1:17pm

Note that HyperQ by default only works for kernels launched from different streams in the same process. There is a tool that enables multiple MPI ranks on a node to run kernels in parallel on the same GPU (called proxy), but documentation is exceedingly sparse. The only place I’ve seen it mentioned is in the GTC talk “S0351 - Strong Scaling for Molecular Dynamics Applications”. CUDA 5.0 comes with the executables “nvidia-cuda-proxy-control nvidia-cuda-proxy-server” that don’t even have --help options.

Gert-Jan · February 8, 2013, 1:46pm

nvidia-cuda-proxy-something sounds really interesting. Even though it does not have a --help option, you can run man nvidia-cuda-proxy-control. This is what the description is:

Options and some documentation you can find via man nvidia-cuda-proxy-control

Guix · February 8, 2013, 3:48pm

Ok, thank you for answers.

@Gert-Jan: You are right about occupancy, I said that to be short. My kernel use a few of registers and does not allocate shared mem (profile is in atachement). The Compute Mode is set on “default” and actually it seems to be best in my case.

@eyalhier74: There is an HyperQ sample provided with CUDA 5.0 but it shows how to launches many kernel in different streams in the same process. In my case I have several MPI processes.

@DrAnderson42: What ?? Multiple MPI access does not work by default ? Thank you for this information, I did not know. Where did you found it ? It should be written in bold in the GK110 Whitpaper… Actually documentation is exceedingly sparse, it’s a pity.

@Everyone: I will try with “nvidia-cuda-proxy-something”. I’ll let you know !

Thanks,
Guix

Guix · February 11, 2013, 2:34pm

Hi evreryone,

I tried to use the “nvidia-cuda-proxy-control” and “nvidia-cuda-proxy-server” executable to run my CUDA/MPI application and it is not a success…
I can run “nvidia-cuda-proxy-control” and launch the proxy control daemon but then I am lost. If I try to launch my CUDA/MPI application I get an error message: all cuda-capable devices are busy or unavailable.
Surely I must use the “nvidia-cuda-proxy-server” too but I do not know how it works and what it does because there is no documentation about this. There is only the man of “nvidia-cuda-proxy-control” which is realy short.

Does anyone ever used “nvidia-cuda-proxy-something” or has VIP documentation which can help me ?

Thanks in advance,
Guix

DrAnderson42 · February 12, 2013, 7:10pm

@Gert-Jan: Aha! I checked for man pages, but was running on a system that didn’t have them installed for some reason. Now I see them.

@Guix: I learned about it in that GTC talk I mentioned. You might want to watch the video recording of the talk to see for yourself. But don’t expect too much on proxy, it is briefly discussed and without details. I only know that it exists, and I have never tried to make it work. It seems like its a very beta feature and not intended for mass-consumption yet.

The man page mentions log files, did you check those? Maybe there is something there to help you.

ukapasi · February 20, 2013, 1:03am

Hi Guys,

Sorry for the confusion on this. I agree the documentation and online help for HyperQ-related features could be much better.

HyperQ refers to two related capabilities of the Tesla K20 and later GPUs:

concurrency, when possible, for kernels launched into different streams in the same process
concurrency, when possible, between kernels launched from different MPI ranks in different processes running

In the CUDA 5.0 release:

#1 is supported and documented (http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#hyperq). There is also sample code in the simpleHyperQ example here: http://docs.nvidia.com/cuda/cuda-samples/index.html#advanced
#2 is supported on a few Cray-based systems (e.g. Titan) in the CUDA 5.0 release. We’re working on productizing (testing, documention, etc.) this feature for a wider range of hardware/software configuration in an upcoming release.

I hope this is helpful. Drop me a PM if you are interested in trying this feature in a pre-release build and providing feedback based on your experience.

Thanks,
Ujval Kapasi
NVIDIA

jclee · June 25, 2013, 7:27pm

It has been 6 months since this thread was last active. Hopefully things have changed a little.

I have machines with 16 cores and 4 Kepler cards each running on Redhat Linux. I am trying to test out codes on this set up before doing the Titan.

Is there a way I can run the cuda proxy on all nodes and then have 4 CPU cores sharing one gpu card in a way that the 4 cuda launches will run concurrently on every gpu card?

This way I can use all 16 cores and all 4 cards on each node.

My code is set up this way: each MPI process makes a single cuda call. I have already tested with a single CPU core and verified that one GPU card can handle the number of gpu threads issued by four MPI processes.

I have the mpi routines (from Oak Ridge) that identifies the device ID’s for each MPI process. Do I just launch the proxy and have each MPI process calling their intended device?

My second question is, if I have the proxy running, can a simplge serial code (not MPI) calls the proxy and run codes on a specified card?

mfatica · July 17, 2013, 3:18pm

I wrote some detailed instructions to enable CUDA MPS ( formerly known as CUDA proxy) on a machine with multiple GPUs.
It is an unsupported configuration, but it works. Details at CUDA Musing: Enabling CUDA Multi Process Service (MPS) with multiple GPUs.

pasoleatis · July 17, 2013, 8:29pm

Hello,

I hope it is ok to jump in the topic. I thought that HyperQ means that different programs, from different processes, like mpi for example or openmp or just running 2 programs on the same card, the kernels would run concurrently. Is this wrong?

According to this page NVIDIA Blog the kernels from different mpi processes would be executed concurrently.

pasoleatis · October 11, 2013, 12:00pm

Sorry to ressurect this topic, but I have one question I am not able to find the answer regarding the HQ.
If I have a program with not communication with cpu which does not use the card 100 %. If I run 2 programs in the same time will it take less time than running them in the same time? On cards without HQ the kernels from each program is just executed sequentially. Will 2 programs run concurrently on Titan?

Topic		Replies	Views
hyper-q Legacy PGI Compilers	9	10077	January 17, 2014
How to enable Hyper-Q on Tesla K20 CUDA Programming and Performance	0	1142	March 1, 2013
Is it possible to concurrently run non-mpi applications through Hyper-Q? CUDA Programming and Performance	2	666	January 11, 2016
Hyper-Q technology CUDA Programming and Performance	8	12304	August 2, 2014
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23911	July 27, 2010
Does CUDA5.5 MPS supports concurrent executions of kernels from different processes? CUDA Programming and Performance	4	936	October 28, 2013
Hyper-Q and OpenMP on single GTX-Titan GPU CUDA Programming and Performance	5	4432	May 21, 2013
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32501	December 13, 2010
How does the GK110's Hyper-Q enable concurrency of multiple streams? CUDA Programming and Performance	2	805	July 28, 2013
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	946	July 27, 2017

HyperQ and MPI

Related topics