GTX 780 released: GK110 for $650

For those (including me) looking to try out dynamic parallelism in CUDA, there’s now an option that costs less than Titan:

http://www.anandtech.com/show/6973/nvidia-geforce-gtx-780-review

12 SMX’s, 3 GB RAM, double precision capped at 1/24 of single precision throughput. (For comparison, Titan has 14 SMX’s, 6 GB of RAM, and the driver mode that runs double precision at 1/3 of single precision throughput.)

Awesome.

It also looks like some GTX 780’s will have “Boost” speeds over 1GHz. That’s a big Cores-x-MHz product!

Interestingly a ‘Superclocked’ 780 and the TITAN have nearly identical SMX*Boost products: 12 * 1020 ≅ 14 * 876.

The forthcoming “ShadowPlay” live screen recording feature also looks useful.

Great!

  • High single precision capacity
  • Higher register/thread count (256!)
  • Dynamic parallelism

It would be very interesting to have a feedback on dynamic parallelism performances on these cards.

I am currently limited by the Cuda Api kernel launching overhead (I have small kernels) and launching each kernel directly from the gpu would be a very interesting feature for me.

Is there any way to find a test on this particular feature, or to propose a simple “microtestbench” code to be executed on such architecture ?

This is a good question. We are also working on a project here where we are curious how the overhead of a launching a few small kernels (part of a larger processing chain) with a single block on the GPU compares to launching the same kernel on the CPU.

One thing to keep in mind is that dynamic parallelism disables parallel kernel invocation from the CPU, since the GPU cannot know in advance how many kernels will be launched from the device side.

I wonder though if this feature is configurable. The Kepler whitepaper gives the impression that there is dedicated logic on the device to dynamically schedule grids from either the host or the device.