Inside Volta: The World’s Most Advanced Data Center GPU

jwitsoe · May 10, 2017, 1:38pm

Originally published at: Inside Volta: The World’s Most Advanced Data Center GPU | NVIDIA Technical Blog

Today at the 2017 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the new NVIDIA Tesla V100, the most advanced accelerator ever built. From recognizing speech to training virtual personal assistants to converse naturally; from detecting lanes on the road to teaching autonomous cars to drive; data scientists are taking on increasingly…

anon86464859 · May 10, 2017, 5:25pm

AyyyMD.

anon99755759 · May 10, 2017, 6:28pm

AMDead

anon97663664 · May 10, 2017, 6:29pm

Don't worry guys the RX580 is still faster!

anon71664741 · May 10, 2017, 6:42pm

Your bank account is dead too.

anon99755759 · May 10, 2017, 6:49pm

Yes yes just gotta overclock it a bit, the gains are ayyymazing!

anon15608204 · May 10, 2017, 8:41pm

815mm^2 on a 12nm process. That is a humongous GPU.

anon75845458 · May 10, 2017, 11:32pm

Reticle limit of TSMC

anon36490751 · May 11, 2017, 6:46am

The independent thread scheduling in a warp looks very interesting. With this feature, is it possible to have threads in different branches participate in warp intrinsics like __ballot or __shfl?

anon80379834 · May 11, 2017, 3:35pm

If the intention is to use 8 GV100 on a DGX-1 why 6 NVlink? should be 7 NVlinks, or i miss something?

anon80379834 · May 11, 2017, 3:37pm

At least i can buy a RX580 even crossfire it to beat a 1080ti, but i have to admit i don't have $140K to expend on nvidia new card solution.

anon5164959 · May 11, 2017, 5:25pm

"Each Tensor Core performs 64 floating point FMA mixed-precision
operations per clock (FP16 input multiply with full-precision
product and FP32 accumulate, as Figure 8 shows) and 8 Tensor Cores in an
SM perform a total of 1024 floating point operations per clock."

How many 4 x 4 matrix-matrix multiplications is that per clock?
I think it's 16 or 64 FMAs per matrix multiplication, so either 4 or 1 MMMs per clock, but the article doesn't say.

The fp16 format is dismayingly approximate, with an infinity that starts above 65,504, and a minimum value above 1 of only a bit less than 1.001. Resolution between 0 and 1 is less cramped, though ( ~13-14 bits, I think) and is often all the application requires, especially dealing with probabilities and data normalized to the [0,1] or [-1, 1]range.

The Tensor cores look like they might be useful for doing 4D Geometric Algebra (GA)/ Clifford Algebra calculations, which would be extremely cool, since GA is the best way to do math representing physics, whether classical mechanics, EM, QM, SR or GR - too many advantages to list here, but I'll point out Geomerics, (the British company that brought real-time radiosity lighting to games, bought by ARM) was the work of the worlds top GA physicists, particularly Cambridge's Chris Doran.

There are only 2 Clifford algebras that can be represented with
real-valued 4 x 4 matrices, Cl(3,1) (signature (+++-)) and Cl(2,2)
(signature (++- -)). The other 4D signatures require 2 x 2 matrices of
quaternions. The 2D + 2 Conformal Geometric Algebra (CGA) has a (+++-) Minkoski signature that can also be used for relativistic EM, (though the (+- - -) "space-time algebra" is more common for that use).

The 2D + 2 CGA represents 2D lines, circles, and points as points in a 4D space. It's like extending first to homogeneous coordinates, as in conventional graphics: the extra dimension allows constructing subspaces (e.g. lines) that don't pass through the origin on the 2D plane. In CGA that extra homogeneous dimension is called "origin" for that reason. In addition, CGA adds another extra dimension called "infinity", which allows representing points, circles and lines (and planes, spheres in 3D +2 CGA) as unified entities - a point has zero infinity component, a circle has some, and a line is a circle passing through the "point at infinity", a circle with infinite radius. Taking the outer product of any 3 points gives the circle passing through those points. An easy "dualization" converts the circle to a representation as a center point and a radius. There are some other primatives such as point-pairs (0D spheres) in CGA as well that are very useful All sorts of geometric operations such as unions, intersections are much easier in CGA.

Obviously the 3D +2 CGA is more useful for 3D graphics, but it is a 5D algebra which doesn't fit in the Tensor units. (Cl(4,1) needs 4x4 complex matrices) It would be useful to find out if there are practical ways to make GA calculation on GPUs easier and faster because that would make physics simulations in general much easier to program. GA gives a single, unified representation to areas that now are a vast collection of ad-hoc hacks that often don't work well together. Chris Doran would be the person to talk to about what would make GPUs better for GA and physics simulation in general.

anon50374324 · May 11, 2017, 5:30pm

Yes. Note that you have to use the new "sync" versions these builtin functions, which take an additional parameter to specify which threads participate in the operation.

anon5164959 · May 11, 2017, 6:07pm

For scale, 35mm camera sensors are 864mm^2 (but they're way lower res. lithography.)

I thought mullti-chip modules had gotten to the point where it was possible to divide such monster chips into several pieces with up to thousands of bus lines going through through-silicon vias then running a just a mm or two across the carrier between chips. I guess not, since if you could restrict the pieces to less than a couple 100 mm^2 then yields could be something like an order of magnitude higher.

anon76846909 · May 12, 2017, 1:53pm

These are not going to be competitive with the TPUs having 65535 MACs with 8 bits which is ideal.

anon87882286 · May 12, 2017, 2:30pm

I assume the __syncwarp() is heavily added by the compiler as well whenever safe? ... but does this give away reconvergence of sub-warps in nested divergence or do the compiler still enforces this in an implicit way (when safe)?

anon37210196 · May 12, 2017, 3:09pm

The compiler uses a different set of instructions for convergence optimizations. You should expect the same convergence as Pascal (for code that both architectures can run) at no additional effort on your part.

anon74021727 · May 12, 2017, 3:46pm

Tesla M40 Peak FP64?

Table 1 above:
TFLOP/s: 2.1 (about 2100 GFLOPs)

https://images.nvidia.com/c...
page 11 Table 1:
GFLOPs: 210

Where does the factor 10 come from?

96 FP64 Cores * 1114 MHz * ? FLOP/cyc
107 GHz * 2 (e.g. single FMA) would be about 210 GFLOPs
also cmp. https://en.wikipedia.org/wi...

anon87882286 · May 12, 2017, 3:51pm

so it converges (roughly?) at IPDOM unless synchronization is detected, it converges at the safest reconvergence point the compiler can detect?

If so then, "You should expect the same convergence as Pascal (for code that both architectures can run) at no additional effort on your part." sounds like the compiler can never "falsely" detect synchronization, which does not sound realistic?

anon4987499 · May 13, 2017, 10:25pm

MPS(Multi-Process Service) has a few restrictions. One of the most mysterious one is unsupport of dynamic parallelism. Is it still prohibited on the Volta generation?

Topic		Replies	Views
NVIDIA Hopper Architecture In-Depth Technical Blog	2	993	August 8, 2022
Nvidia announces Tesla V100 (Volta) CUDA Programming and Performance	19	5222	November 30, 2017
NVIDIA Ampere Architecture In-Depth Technical Blog	0	930	August 25, 2020
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1151	September 27, 2024
Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand Technical Blog	0	310	November 8, 2023
Mixed-Precision Programming with CUDA 8 Technical Blog	1	378	February 23, 2017
NVIDIA Turing Architecture In-Depth Technical Blog	12	798	September 25, 2018
Accelerating TensorFlow on NVIDIA A100 GPUs Technical Blog	0	512	August 25, 2020
CUDA 8 Features Revealed Technical Blog	51	848	November 8, 2018
CUDA 11 Features Revealed Technical Blog	4	665	October 16, 2024

Inside Volta: The World’s Most Advanced Data Center GPU

Related topics