hyper-q

jand · April 10, 2013, 10:57pm

Hi,

I am starting to use a K20 card and am wondering if there are any examples of how to use the hyper-q feature in cuda fortran.

My problem: I am running simulations with multiple MCMC chains (each chain is simulated by one MPI thread) in parallel and I would like the threads to access a single K20 card simultaneously.

Thanks, Jan

MatColgrove · April 11, 2013, 4:37pm

Hi Jan,

Hyper-Q just expands the number of streams and contexts that the device can handle. So in CUDA there’s nothing Hyper-Q specific, rather you just need to utilize the already existing streams construct and/or attach multiple host processes to a single device.

This article isn’t Hyper-Q specific, but does give an overview of Asynchronous data movement and using stream. Account Login | PGI

Mat

jand · May 9, 2013, 11:19pm

On the Nvidia Dev Zone, I found this:

In the CUDA 5.0 release:

#1 is supported and documented (Kepler Tuning Guide :: CUDA Toolkit Documentation). There is also sample code in the simpleHyperQ example here: CUDA Samples :: CUDA Toolkit Documentation

#2 is supported on a few Cray-based systems (e.g. Titan) in the CUDA 5.0 release. We’re working on productizing (testing, documention, etc.) this feature for a wider range of hardware/software configuration in an upcoming release.

Does anyone have experience with #2 and CUDA Fortran?

Thanks, Jan

MatColgrove · May 10, 2013, 12:04am

Does anyone have experience with #2 and CUDA Fortran?

See the article I noted above. Hyper-Q helps to better utilize streams so there’s nothing new in how to program. You just may see better performance by using asynchronous kernels and multiple streams.

Mat

jand · May 10, 2013, 1:36am

Hi Mat,

the first part of what I meant to quote got lost. So here it is:

"HyperQ refers to two related capabilities of the Tesla K20 and later GPUs:

concurrency, when possible, for kernels launched into different streams in the same process
concurrency, when possible, between kernels launched from different MPI ranks in different processes running in parallel on the same node."

( HyperQ and MPI - CUDA Programming and Performance - NVIDIA Developer Forums )

I think you refer to option 1. However, my understanding is that Hyper-Q can be utilized by multiple MPI threads without requiring any changes to existing code. Apparently this is done by running a cuda proxy server.

Thanks, Jan

MatColgrove · May 10, 2013, 4:02pm

Hi Jan,

The cuda proxy daemon is news to me, so I’d need to refer you to NVIDIA for more information. I’m with one of the commentors that was surprised you’d need to run this. It’s my understanding that you didn’t need to do anything special. Note that CUDA Fortran uses the same underlying mechanisms as CUDA C, so anything that applies to CUDA C will apply to CUDA Fortran.

Back in 2011, I did write an article on Mulit-GPU programming using MPI and CUDA Fortran. See: Account Login | PGI. At the time I wrote the statement “setting up more than one Context on a single device is not supported.” However this is now incorrect for Kelper given Hyper-Q allows multiple context. While this code isn’t a good performance benchmark since it does so little work, you can use it to test multiple MPI processes attaching to a single device.

Mat

jand · May 13, 2013, 6:36pm

Thanks Mat.

I also got a reply from Nvidia which may of interest to some. Apparently, the overlapping of MPI processes will be supported in CUDA 5.5 this summer.

-Jan

Quote from Ujval Kapasi:

You can do that now even on older HW, actually. Basically, you can run a different process on each core in your node, corresponding to different MPI ranks in your application. Each process can issue work (PCI transfers and computation) to the same GPU.

However, the older hardware and software will not overlap execution of items issued by different processes. These will be handled in serial.

However, HyperQ on K20 is better because it allows the hardware to overlap exectuion of items from different processes on the same node, when possible. In order to access that functionality on K20, you will need CUDA 5.5, which has not been released yet.

When CUDA 5.5 is released this summer, it will contain support for this. You will need to run a special server process to enable the functionality, and hence you will need system administrator priveledges on your node.

Ujval

danah · January 17, 2014, 7:54pm

Mat,

You mention Back in 2011, you wrote an article on Multi-GPU programming using MPI and CUDA Fortran. Do you have a more recent article or an example that runs with CUDA 5.5?

Thanks, dana

MatColgrove · January 17, 2014, 8:19pm

Hi Dana,

Thanks for your interest, though no, sorry I haven’t updated the article recently. However, the basic information is still valid and useful with CUDA 5.5.

At the time, MPI aware GPUDirect was just being implemented. I had planned on doing a follow-up article once GPUDirect became more mature and more GPUDirect NIC cards were available. However, in the last few years I’ve been focusing on OpenACC rather than CUDA Fortran so never got back to it. Let me talk with some of the other application engineers who do focus on CUDA Fortran and see if they can write a follow on article.

Is there something in particular that you’re interested in learning how to do?

Mat

MatColgrove · January 17, 2014, 11:11pm

Hi Dana,

I talked with Greg Ruetsch. He has a chapter on using MPI and CUDA Fortran, including GPUdirect, in his book CUDA Fortran for Scientists and Engineers including source code examples, that may be useful. He mention that the most difficult part is getting MVAPICH set-up correctly but there’s README file in the code examples which explains this.

Mat

Topic		Replies	Views
HyperQ and MPI CUDA Programming and Performance	12	7418	October 11, 2013
How to enable Hyper-Q on Tesla K20 CUDA Programming and Performance	0	1142	March 1, 2013
Hyper-Q technology CUDA Programming and Performance	8	12304	August 2, 2014
How does the GK110's Hyper-Q enable concurrency of multiple streams? CUDA Programming and Performance	2	805	July 28, 2013
Concurrent streams and hyperQ for K20 CUDA Programming and Performance	2	1769	February 20, 2013
I have the following conceptual questions : CUDA Programming and Performance	6	755	August 15, 2017
Does CUDA5.5 MPS supports concurrent executions of kernels from different processes? CUDA Programming and Performance	4	936	October 28, 2013
Processing 8 stream at the same time not 32 stream using Hyper Q on K20 CUDA Programming and Performance	3	1429	July 25, 2013
Hyper-Q and OpenMP on single GTX-Titan GPU CUDA Programming and Performance	5	4432	May 21, 2013
HyperQ - not changing any existing codebase ? CUDA Programming and Performance	3	967	May 22, 2013

hyper-q

Related topics