CUDA 9 Features Revealed: Volta, Cooperative Groups and More

Originally published at:

Figure 1: CUDA 9 provides a preview API for programming Tesla V100 Tensor Cores, providing a huge boost to mixed-precision matrix arithmetic for deep learning. At the 2017 GPU Technology Conference NVIDIA announced CUDA 9, the latest version of CUDA’s powerful parallel computing platform and programming model. CUDA 9 is now available as a free…

The Gtx 1050 Ti Laptop, will be supported?

What's exactly new, in terms of assembly programming ?
New libraries and software isn't exactly what you would call "new".

Also, could you drop that idea of threads you promoted so long ?
Initially it was used for marketing purposes, to claim that GPU is capable of running 32x more threads than it actually can. I see no point in getting things worse and worse just for marketing lies. It would be easier for developers to adjust their code to the actual operations performed by GPU, that is array operations.

CUDA programs execute thousands of parallel threads. The threads are not as heavyweight as CPU threads, and they are created and run in parallel, but they are still threads, free to branch and take different execution paths. They are not limited to just array operations. Volta and CUDA 9 make this even more flexible.


I didn't know there were SIMT deniers.

How come ? If one divides CPU cache into 64 byte-sized private areas and run 1 thread, one could claim that there are 512 threads per L1 cache. Such "parallel" threads would be still 1 thread actually. If they can't branch independently and the branches are emulated by having the rest of them wait till the branch is performed, they are not threads, in actual meaning.

I've always thought the following blog post was a thoughtful, unbiased, well-explained analysis by someone who gets it.

Is there a possibility that NVIDIA will publish the data on the following:
Instruction format (encoding) and timing for the new architecture,
possibility to use registers as temporary up to (before) the time they are written by the result of the operation (I wanted to know if one could use them as temporary until a write is performed)
tagging instructions for pseudo-threads and their effect, theoretical, if any,
register file write and read operation description, as well as addressing and throughput for register groups,
extension for tensor operation to the instruction set, if any.

@Mark_Harris:disqus I enjoyed reading that blog post, thanks for sharing.

Thanks for the share Mark, the SIMT vs SIMD article is excellent.

Looks like the recorded sessions won't be available until June 8th for people who did not attend GTC 2017.

Hello Mark!

when i read this blog, 2 things catch my eyes:

1) in your screenshots you use a Titan card, but talking "as usual"
extensively about Tesla (okay only tesla exist now with volta engine, but on
many other nvidia paper you "unintentionally" talk about telsa or quadro, and very
few about titan)

since this blog is about CUDA 9, can you make a complete
list of all GPU compatible with CUDA 9?

2) "These new meta packages provide simple and clean installation of CUDA
libraries for deep learning and scientific computing"

my question

I’m trying to clearly understand Nvidia positioning about the titan-line of

[+] Titan are simply the best choice for "any" CUDA developer : they
are (since their introduction) the fastest CUDA FP32 hardware available [titan XP even better than GP100], they are versatile (can works in any machine, actively cooled, support
both WDDM and TCC mode), they support almost the biggest VRAM, they have a
standard-positioned power plug, they provide -by around 6x- much more $ per
CUDA core ratio than any other "professional" card

[+] i think they were introduced as this, a perfect CUDA card for high end
workstations + Compute

now, since the most differentiation between all CUDA oriented cards is mostly
inside the driver (allowing or not TCC mode, optimized for openGL/CAD, allowing
or not virtualization, allowing or not 10bit output, allowing or not
remote/grid, only for servers, ...)

[-] why does titan card are using consumer/gaming drivers? With the famous
"unintentional virtualized bug" and the windows code 43 result + not
compatible with quadro/tesla drivers.

[-] i think it's time that Nvidia create a specific driver for “titan-for-compute”,
focusing on TCC mode + virtualization support - the perfect companion of any
CUDA developer

Also, this new driver will then indeed "provide simple and clean
installation of CUDA libraries for deep learning and scientific computing",
without requiring to install an huge driver package.


Here's complete list of CUDA compatible GPUs:

thanks for this list, however i guess that :
- volta does require CUDA9 to work
- older chipset (i will say from kelpler to pascal) will all works ? i guess that some features will not be enabled? so a simple drivers update will bring cuda9 to all cards?

for question 2 :
I'm "dreaming" of a day that Nvidia will propose a clean, efficient and global driver for compute and deep learning. so no more "artificial" separation (from a driver perspective) between titan, quadro and tesla. i will be up to nvidia to cleary differentiate each product (eg : titan for fast and affordable CUDA development without compromise -specifically in the new virtualisation area-, telsa for datacenter and supercomputer and scientific compute/fp64, quadro for engineering and design). also we should be able to freely mix any of those card on same system, with unified driver.
basically, buy and use the most efficient card for each usage, without any driver limitation as today...


Will we one day teach AI how to code and let the software developers die out?

Have you planned to make the matrices Row-Major? O Any direct conversion?

Can you clarify your question?

Existing CUDA codes should run on Volta. To build new codes for Volta, you need CUDA 9. CUDA 9 will support older GPUs as well. Your "question 2" is not a question.