Besides the Titan and Tesla line, which GPUs support Hyper-Q

Some new work necessitates the use of Hyper-Q for multiple concurrent kernels, and I wanted to know if the GTX 780 supports Hyper-Q? My understanding is that the GTX Titan and the Tesla lines support it, and they are compute capability 3.5, as is the GTX 780.

Also, is the GTX 780 the cheapest non-mobile GPU which is 3.5 ?

Some GT 640 models and GT 630 models are Compute 3.5 as well. What they have in common is that they are equipped with no more than 2GB of GDDR5 memory.

Models with 4 GB DDR3 are currently guaranteed to be either Fermi or Compute 3.0 parts…

Take a look at this thread:

https://devtalk.nvidia.com/default/topic/599056/concurrent-kernel-and-events-on-kepler/

The interesting thing is that while the $45 GK208 GT 635 SM_35 card that I bought recently and have driving my displays will run the simpleHyperQ example just fine on Linux, the kernels do not run concurrently in Windows. Tried setting the same CUDA_DEVICE_MAX_CONNECTIONS environment variable in Windows, but still the same behaviour.

Nice, I just ordered 4 of these poor GT635 OEM cards into brutal cryptocoin mining slavery.

Excellent! I’ve found that it’s nice to have a GK110 and a GK208 in the same machine.

Also, the GT635 might be a decent proxy for the forthcoming Tegra K1 (although with twice as many SMXs).

So, assuming the operating system is Ubuntu 13.04, will the GTX 780 support Hyper-Q ?

If so, how many concurrent kernels?

I went back to an old Anandtech GTX Titan review that I knew I had posted a comment in regards to simpleHyperQ. I ran the SimpleHyperQ example in Windows (WDDM) when I had the GTX Titan and found out that you are able to do up to 8 streams concurrently. This is also supported by this post: https://devtalk.nvidia.com/default/topic/544138/cuda-programming-and-performance/hyper-q-and-openmp-on-single-gtx-titan-gpu/post/3810998/#3810998

However, in the other thread I also tested the GTX Titan in Linux and was able to do up to 32 streams concurrently:
https://devtalk.nvidia.com/default/topic/599056/cuda-programming-and-performance/concurrent-kernel-and-events-on-kepler/post/3957491/#3957491

Further, the GT 635 GK208 card I tested under Linux is able to do 16 streams concurrently.
Edit: This same card is able to do 16 streams in Windows 7 x64. Perhaps my Windows x64 setup had some driver issues.

If I had to take a guess, I would think that the GTX 780 would support 32 concurrent streams in Linux, and possibly one of: 8, 16, or 32 concurrent streams in Windows. Would be great if someone could confirm those suspicions.

The GK208 WDDM (Win7/x64) seem to work for me up to 16. No?

And for the K20c TCC on Win7/x64 works all the way up to 32:

One more thought, it makes sense to only support a max of 16 streams on a single SMX since Kepler only supports 16 resident blocks per SMX. I know the GK208 has two SMX’s but perhaps a 1:1 ratio of streams to total device blocks would be overkill on this chip. The GK208 is already ludicrously well-featured. :)

Perhaps this means that the Tegra K1 with its single SMX will support 16 streams? Now that would be awesome.

I tried the same GT 635 card under Windows 7 x64 (see post below), and I see the 16 concurrent streams. Either my driver setup under Win 8 x64 was flawed or some other issue was showing up (driver bug?) Here were are my results on Windows 8 x64, TCC driver for Quadro K6000:

Also, I didn’t see CUDA 6.0 out yet on the registered developer site… how do you like it?

Here is a GT 635 on another Win7/x64 machine with the 332.21 driver. It seems to work fine:

Very strange… I re-checked the same GT 635 card using the same hardware under a Windows 7 x64 setup with the same 332.21 driver and I can now see the 16 concurrent streams on the GT 635 card. Perhaps my previous setup had some driver issues or the Win 8 x64 drivers have a bug. Here’s how it comes up now:

A question related to the concurrent kernels launched using hyper-Q:

It stands to reason that you would want each concurrent kernel in the launch set to operate(write and read) in its own exclusive memory space, but what if a set of concurrent kernels all want to read from the same global memory space?