Inside Volta: The World’s Most Advanced Data Center GPU

Originally published at:

Today at the 2017 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the new NVIDIA Tesla V100, the most advanced accelerator ever built. From recognizing speech to training virtual personal assistants to converse naturally; from detecting lanes on the road to teaching autonomous cars to drive; data scientists are taking on increasingly…



Don't worry guys the RX580 is still faster!

Your bank account is dead too.

Yes yes just gotta overclock it a bit, the gains are ayyymazing!

815mm^2 on a 12nm process. That is a humongous GPU.

Reticle limit of TSMC

The independent thread scheduling in a warp looks very interesting. With this feature, is it possible to have threads in different branches participate in warp intrinsics like __ballot or __shfl?

If the intention is to use 8 GV100 on a DGX-1 why 6 NVlink? should be 7 NVlinks, or i miss something?

At least i can buy a RX580 even crossfire it to beat a 1080ti, but i have to admit i don't have $140K to expend on nvidia new card solution.

"Each Tensor Core performs 64 floating point FMA mixed-precision
operations per clock (FP16 input multiply with full-precision
product and FP32 accumulate, as Figure 8 shows) and 8 Tensor Cores in an
SM perform a total of 1024 floating point operations per clock."

How many 4 x 4 matrix-matrix multiplications is that per clock?
I think it's 16 or 64 FMAs per matrix multiplication, so either 4 or 1 MMMs per clock, but the article doesn't say.

The fp16 format is dismayingly approximate, with an infinity that starts above 65,504, and a minimum value above 1 of only a bit less than 1.001. Resolution between 0 and 1 is less cramped, though ( ~13-14 bits, I think) and is often all the application requires, especially dealing with probabilities and data normalized to the [0,1] or [-1, 1]range.

The Tensor cores look like they might be useful for doing 4D Geometric Algebra (GA)/ Clifford Algebra calculations, which would be extremely cool, since GA is the best way to do math representing physics, whether classical mechanics, EM, QM, SR or GR - too many advantages to list here, but I'll point out Geomerics, (the British company that brought real-time radiosity lighting to games, bought by ARM) was the work of the worlds top GA physicists, particularly Cambridge's Chris Doran.

There are only 2 Clifford algebras that can be represented with
real-valued 4 x 4 matrices, Cl(3,1) (signature (+++-)) and Cl(2,2)
(signature (++- -)). The other 4D signatures require 2 x 2 matrices of
quaternions. The 2D + 2 Conformal Geometric Algebra (CGA) has a (+++-) Minkoski signature that can also be used for relativistic EM, (though the (+- - -) "space-time algebra" is more common for that use).

The 2D + 2 CGA represents 2D lines, circles, and points as points in a 4D space. It's like extending first to homogeneous coordinates, as in conventional graphics: the extra dimension allows constructing subspaces (e.g. lines) that don't pass through the origin on the 2D plane. In CGA that extra homogeneous dimension is called "origin" for that reason. In addition, CGA adds another extra dimension called "infinity", which allows representing points, circles and lines (and planes, spheres in 3D +2 CGA) as unified entities - a point has zero infinity component, a circle has some, and a line is a circle passing through the "point at infinity", a circle with infinite radius. Taking the outer product of any 3 points gives the circle passing through those points. An easy "dualization" converts the circle to a representation as a center point and a radius. There are some other primatives such as point-pairs (0D spheres) in CGA as well that are very useful All sorts of geometric operations such as unions, intersections are much easier in CGA.

Obviously the 3D +2 CGA is more useful for 3D graphics, but it is a 5D algebra which doesn't fit in the Tensor units. (Cl(4,1) needs 4x4 complex matrices) It would be useful to find out if there are practical ways to make GA calculation on GPUs easier and faster because that would make physics simulations in general much easier to program. GA gives a single, unified representation to areas that now are a vast collection of ad-hoc hacks that often don't work well together. Chris Doran would be the person to talk to about what would make GPUs better for GA and physics simulation in general.

Yes. Note that you have to use the new "sync" versions these builtin functions, which take an additional parameter to specify which threads participate in the operation.

For scale, 35mm camera sensors are 864mm^2 (but they're way lower res. lithography.)

I thought mullti-chip modules had gotten to the point where it was possible to divide such monster chips into several pieces with up to thousands of bus lines going through through-silicon vias then running a just a mm or two across the carrier between chips. I guess not, since if you could restrict the pieces to less than a couple 100 mm^2 then yields could be something like an order of magnitude higher.

These are not going to be competitive with the TPUs having 65535 MACs with 8 bits which is ideal.

I assume the __syncwarp() is heavily added by the compiler as well whenever safe? ... but does this give away reconvergence of sub-warps in nested divergence or do the compiler still enforces this in an implicit way (when safe)?

The compiler uses a different set of instructions for convergence optimizations. You should expect the same convergence as Pascal (for code that both architectures can run) at no additional effort on your part.

Tesla M40 Peak FP64?

Table 1 above:
TFLOP/s: 2.1 (about 2100 GFLOPs)
page 11 Table 1:
GFLOPs: 210

Where does the factor 10 come from?

96 FP64 Cores * 1114 MHz * ? FLOP/cyc
107 GHz * 2 (e.g. single FMA) would be about 210 GFLOPs
also cmp.

so it converges (roughly?) at IPDOM unless synchronization is detected, it converges at the safest reconvergence point the compiler can detect?

If so then, "You should expect the same convergence as Pascal (for code that both architectures can run) at no additional effort on your part." sounds like the compiler can never "falsely" detect synchronization, which does not sound realistic?

MPS(Multi-Process Service) has a few restrictions. One of the most mysterious one is unsupport of dynamic parallelism. Is it still prohibited on the Volta generation?