Hyper-Q technology

Hi I am experimenting with the Hyper-Q technology of NVIDIA. It comes at a cheap price of coding. I have a GTX-Titan which has the GK110 processor so I can experiment with this technology. Its interesting seeing in the time line of Visual Profiler starting at the same time in my experiments. I must say the Visual Profiler is an amazing tool, simply amazing.

What I would like to ask to everybody that knows.

  1. Is there a special directive that I need to use and I do not know for Hyper-Q from what I have seen there shouldn’t be.
  2. Is there any Scientific paper/document taking advantage of the Hyper-Q. Surely there would be from me if I manage to make what I want happen. Research in taking advantage new technologies is risky, so I really do not know. In any case results will appear in my Student’s Master Thesis, good or bad.

I would like to add two interesting figures to show that HyperQ actually works with 8 streams in the GTX-Titan. 8 Streams no more but perfect for my experiment.

Timeline on K2000M (Kepler < SM 3.5) No HyperQ

Timeline on GTX-Titan (Kepler = SM 3.5) With HyperQ

So this encouraging image generates automatically the question : Can you do distributed computing with streams? This means zero communication cost. At which gain if any?

Hyper-Q refers to some architectural and HW changes that occur in cc3.5 and newer devices.

At a pure CUDA programming level, the benefits are realized in “easier” concurrency, by helping to eliminate false dependencies. To understand this better, I would refer to the simple HyperQ programming example including the PDF file:

some information about Hyper-Q is also in the Kepler tuning guide:


Another capability supported by Hyper-Q, is the ability to more effectively utilize the GPU by running concurrent kernels. This is just an extension of the above, of course, but to facilitate it, there is a special utility called CUDA MPS, which has a primary purpose to enable multiple MPI ranks to efficiently utilize a single GPU (enabled by Hyper-Q). CUDA MPS is documented here:

And there is a description of how it can be used for benefit in an MPI application here:


As you can see in the example figures there are no dependencies between the Kernels and are run concurrently. So I have a check on this. There are major other issues I need to see of what is really happening and if a stream is being stalled by another stream while they execute concurrently. This will have the immediate effect of them running at the same time yes but the actual execution time to be longer. I would really like to know if there is a paper/document fully analyzing this case.

Also about MPI, first of all it will not be useful in my case since I am bounded by the communication cost and I really do not know whether this technology has the option to use only device buffers to communicate and if they pass through the PCI-Express port. With Streams you do not have this problem, I want to give a hybrid approach of distributed and memory sharing feeling to the project. Alas a full analysis need to be made. Also! The GTX-Titan (at least the first not the black) has locked the feature for the MPI concurrency, I do not know about the black edition.

You can increase the number of streams available for Hyper-Q but setting the environment variable

‘CUDA_DEVICE_MAX_CONNECTIONS’ to a number >8 (32 I believe is limit). 8 is default.

Keep in mind the kernels launched for Hyper-Q need to be of modest-moderate size otherwise it may end up serializing the streams.

One of my older MATLAB callable project has some code examples;


lines 159-170 is an example of using Hyper-Q with cuBLAS
lines 274-281 is an example of using Hyper-Q with cuSPARSE

Oleg, serialization is one of the things I am also afraid. Thanks for bringing this up.

Oleg (or whoever knows). If I want to turn off HyperQ do I need to set the environmental value to one or can I simply compile with SM < 3.5?

Each independent kernel which could be launched in parallel with Hyper-Q needs a unique stream id, so all you would have to do in order to deliberately launch in serial is give each kernel the same stream id(or if you want a fixed amount then give each memember of that group a stream id from a limited set of streams).

I would guess that you could also set CUDA_DEVICE_MAX_CONNECTIONS to a lower value than default as well, but since you can control the degree of Hyper-Q streams in other ways that may be an inconvenient approach.

You are right with the ID. This way I can check several issues. I will put also a dummy kernel doing nothing in the loop to avoid concurrent execution of streams with the same ID or a synchronization barrier or I could issue a stream synchronization, lots of options to play with. I can check this way serialization and various other issues. Lets see the timeline then…Thanks, will do it this week.

Hyper-Q technology have not received lot of attention in paper production, it comes for free and is very interesting in streams because they use device memory. So a thorough research needs to be done. I will do it in this simplistic example of this project. I did it distributed in no-time but I want to make it ideal, this needs time…

Edit (7/29/2014) : Just for the record I have serialized the streams and they take 1967 micro-secs and with Hyper-Q I have 414 micro-secs. So Hyper-Q actually works without even optimizations.

As Oleg very rightfully said :

Keep in mind the kernels launched for Hyper-Q need to be of modest-moderate size otherwise it may end up serializing the streams.

So this is a necessary warning.