I am starting to use a K20 card and am wondering if there are any examples of how to use the hyper-q feature in cuda fortran.

My problem: I am running simulations with multiple MCMC chains (each chain is simulated by one MPI thread) in parallel and I would like the threads to access a single K20 card simultaneously.

Thanks, Jan

Hi Jan,

Hyper-Q just expands the number of streams and contexts that the device can handle. So in CUDA there’s nothing Hyper-Q specific, rather you just need to utilize the already existing streams construct and/or attach multiple host processes to a single device.

This article isn’t Hyper-Q specific, but does give an overview of Asynchronous data movement and using stream. http://www.pgroup.com/lit/articles/insider/v3n1a4.htm

  • Mat

On the Nvidia Dev Zone, I found this:

In the CUDA 5.0 release:

#1 is supported and documented (http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#hyperq). There is also sample code in the simpleHyperQ example here: http://docs.nvidia.com/cuda/cuda-samples/index.html#advanced

#2 is supported on a few Cray-based systems (e.g. Titan) in the CUDA 5.0 release. We’re working on productizing (testing, documention, etc.) this feature for a wider range of hardware/software configuration in an upcoming release.

Does anyone have experience with #2 and CUDA Fortran?

Thanks, Jan

Does anyone have experience with #2 and CUDA Fortran?

See the article I noted above. Hyper-Q helps to better utilize streams so there’s nothing new in how to program. You just may see better performance by using asynchronous kernels and multiple streams.

  • Mat

Hi Mat,

the first part of what I meant to quote got lost. So here it is:

"HyperQ refers to two related capabilities of the Tesla K20 and later GPUs:

  1. concurrency, when possible, for kernels launched into different streams in the same process

  2. concurrency, when possible, between kernels launched from different MPI ranks in different processes running in parallel on the same node."

( https://devtalk.nvidia.com/default/topic/529136/hyperq-and-mpi/ )

I think you refer to option 1. However, my understanding is that Hyper-Q can be utilized by multiple MPI threads without requiring any changes to existing code. Apparently this is done by running a cuda proxy server.

Thanks, Jan

Hi Jan,

The cuda proxy daemon is news to me, so I’d need to refer you to NVIDIA for more information. I’m with one of the commentors that was surprised you’d need to run this. It’s my understanding that you didn’t need to do anything special. Note that CUDA Fortran uses the same underlying mechanisms as CUDA C, so anything that applies to CUDA C will apply to CUDA Fortran.

Back in 2011, I did write an article on Mulit-GPU programming using MPI and CUDA Fortran. See: http://www.pgroup.com/lit/articles/insider/v3n3a2.htm. At the time I wrote the statement “setting up more than one Context on a single device is not supported.” However this is now incorrect for Kelper given Hyper-Q allows multiple context. While this code isn’t a good performance benchmark since it does so little work, you can use it to test multiple MPI processes attaching to a single device.

  • Mat

Thanks Mat.

I also got a reply from Nvidia which may of interest to some. Apparently, the overlapping of MPI processes will be supported in CUDA 5.5 this summer.


Quote from Ujval Kapasi:

You can do that now even on older HW, actually. Basically, you can run a different process on each core in your node, corresponding to different MPI ranks in your application. Each process can issue work (PCI transfers and computation) to the same GPU.

However, the older hardware and software will not overlap execution of items issued by different processes. These will be handled in serial.

However, HyperQ on K20 is better because it allows the hardware to overlap exectuion of items from different processes on the same node, when possible. In order to access that functionality on K20, you will need CUDA 5.5, which has not been released yet.

When CUDA 5.5 is released this summer, it will contain support for this. You will need to run a special server process to enable the functionality, and hence you will need system administrator priveledges on your node.



You mention Back in 2011, you wrote an article on Multi-GPU programming using MPI and CUDA Fortran. Do you have a more recent article or an example that runs with CUDA 5.5?

Thanks, dana

Hi Dana,

Thanks for your interest, though no, sorry I haven’t updated the article recently. However, the basic information is still valid and useful with CUDA 5.5.

At the time, MPI aware GPUDirect was just being implemented. I had planned on doing a follow-up article once GPUDirect became more mature and more GPUDirect NIC cards were available. However, in the last few years I’ve been focusing on OpenACC rather than CUDA Fortran so never got back to it. Let me talk with some of the other application engineers who do focus on CUDA Fortran and see if they can write a follow on article.

Is there something in particular that you’re interested in learning how to do?

  • Mat

Hi Dana,

I talked with Greg Ruetsch. He has a chapter on using MPI and CUDA Fortran, including GPUdirect, in his book CUDA Fortran for Scientists and Engineers including source code examples, that may be useful. He mention that the most difficult part is getting MVAPICH set-up correctly but there’s README file in the code examples which explains this.

  • Mat