Nvidia announces Tesla V100 (Volta)

Wow. Any idea when these will be commercially available?

The DOE is getting them first for’Summit’and ‘Sierra’;

https://www.nextplatform.com/2016/11/20/details-emerge-summit-power-tesla-ai-supercomputer/

By far the best technical details are surprisingly from WCCFtech.

L1 cache is merged back with shared memory (perhaps like Kepler did)

New “L0” cache is vague, but in some way enables simultaneous fp16 and fp32 operations. Since the operand collectors would be difficult to expand wider than the current three-words-per-clock, I am (totally guessing speculation) that the L0 is some kind of expanded SASS .reuse buffer(s) allowing ALUs to keep a per-warp cache of recently used operands, freeing the operand collector to feed other units in parallel. I also (very wildly speculate) the ALU results could also feed back into this L0 cache to allow for accumulation loops that don’t need to round-trip the results back into and then right out of the register file each iteration, lowering energy use but also reducing return value output bandwidth use which has to be shared as well (how else will fp32 and fp16 results both get returned per clock?)

Looks like it’s still 64K 32-bit registers per multiprocessor.

( Math: 20 * 1024 * 1024 / 80 / 4 = 64K )

I’m going to miss the standard green trim but, as Raffi Jaharian always says, “Gold is best!”

Allanmac, it’s explicitly said in https://devblogs.nvidia.com/parallelforall/inside-volta/

Most important topic, though, is the new SIMT model. each real core now includes only 16 ALUs. does it mean that warp with only 16 lanes active can be executed with 100% efficiency?

another inetersting question iis what they will cutoff in gv102. hopefully, only fp64/tensor engines

Thanks @BZ! I missed the detailed “Inside Volta” link.

It looks like all 3 matrices of the Tensor Core (fp16/fp16/fp16 or fp16/fp16/fp32) will fit into a single register in each warp lane (32 32-bit words).

I’m trying to understand the use of the tensor core over existing options, does it basically just boil down to doing fp16 FMAs at double the speed, but hidden within a matrix multiply? Does this mean there will be a vectorized HMUL instruction that can do two multiplies in parallel?

I’ll just leave this here: https://twitter.com/pixelio/status/269604497972670465

with 16 FP32/INT cores dispatcher will double pump a SIMD32 warp. It shows 1 warp/clock. so to keep both INT and FP pipes busy, there should be a dual - issue of an INT/FP instruction from the same warp.

Soooo… how much?

DGX-1 with P100 at $129,000*
DGX-1 with V100 at $149,000*

I suppose if you extrapolate based on the fact that system has 8x GPUs, then each V100 board would be $2500 more than the P100 equivalent. Accounting for today’s pricing, a Quadro GP100 (PCI-E version of the P100) is anywhere from 6700-8900 USD, so tack on the price difference and it wouldn’t surprise me they’d set the MSRP of this at 9999 USD…

DGX station at $69000. That’s a watercooled workstation and the price of Tesla model S.
You know, I’d rather drive a Tesla than compute on one.

Christian

I would agree that the biggest news is the updated threading model. Having forward progression and synchronization on diverged thread groups of a warp is not only going to give prettier code but potentially also better performance.

Also related to the SIMT model, the cooperative groups API:s should make our lives a little bit easier: https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

Now that most of you had time to digest the news (and possibly reach out to get more info), here are some ideas I had that I’d like to hear thoughts on:

  • Given the SM structure (Fig. 5 of [1]) it seems like we’re (back to?) execution of two half-warps in two issues; to compensate i) latency is reduced (that of FMA by 50%, no word about the rest)’ ii) INT32/FP32 (only?) can be co-issued which will require ILP, specifically independents instructions of the kind that can be co-issued (hello Kepler-style very theoretical max IPC?);

  • L0 will presumably also aid the above by caching instructions or intermediate results?

  • Would half-warp optimizations become feasible? How does the scheduler group threads and how efficient is uniform branch over exactly half warp?

  • __syncwarp() cost?

  • Any other insights?

Looking forward to MPI running on the GV100! ;)

[1] https://devblogs.nvidia.com/parallelforall/inside-volta/

I suspect that the CUDA shuffle intrinsics remain limited to warps, and hence data exchange within arbitrary cooperative thread groups has to be done through shared memory.

It would be immensely helpful to be able to arbitrarily shuffle data among any thread of a thread group.

Outstanding :-)

Question: Does V100 also have the dp4a and dp2a instructions in hardware? For P100 nVidia dropped them off the die (ran out of die space?). I wonder if with V100 they put them back in.

Yes, those instructions are available on V100. For example:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#volta

IDP4A Integer Dot Product and Accumulate

Note that the throughput of an operation gated by FP16 TensorCore ops may actually be higher than the throughput of an operation gated by IDP4A. AFAIK the IDP4A rate on V100 is 4x the FP32 rate, just like the sm_61 devices that supported this (so 4x15TF, i.e. 60TOps/s peak theoretical, for V100), whereas TensorCore has a peak theoretical multiply-accumulate rate of 120TF for FP16.

Therefore the story around Deep Learning Inferencing in the Pascal generation that focused on e.g. P4 and P40, and used DP4A/INT8 for max throughput, may (will) shift to FP16 for max inferencing throughput on V100.

Confirmed support for dp4a/dp2a on Volta is good news.

Any idea if this runs in a different circuit than the regular IADD/IMAD instructions? I am wondering if the integer throughput for DP4A and IADD/IMAD might be cumulative.

and yes, I am looking into abusing Tensor units for integer ops as well.