Inside Volta: The World’s Most Advanced Data Center GPU

Will the Volta architecture be extended to GPUs for graphics boards in addition to GPUs for data center accelerators?

it seems that each FP32/INT32 instruction sheduled occupy sub-SM for a two cycles, so on the next sycle other type of instruction should be issued - pretty similar to LD/ST instructions on all NVidia GPUs as well as sheduling on SM 1.x GPUs

So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

is my understanding correct?

That is correct.

Will the tensor core intrinsics be able to work on any arbitrary 4x4 submatrices (of any bigger matrix) or they have to be linear in memory?
As in, can I just specify the coordinates of A, B, C and D submatrices in bigger matrix and tensor core will be working on those directly?
If yes, what stopped NVIDIA from providing full FP32 4x4 matrix multiplication core?

Not supported in CUDA 9 due to schedule constraints, but should be supportable on Volta MPS in the future.

Is the 2.4x faster ResNet-50 Training using Tensor Core or not?

The tensor core API will initially provide 3 warp-level operations: 1) load a "fragment" of matrix data from memory into registers. 2) perform a warp-cooperative matrix-matrix multiply on the input fragments in the registers of a warp and 3) store matrix "fragment" from registers into memory. The initial API will operate on 16x16 matrix fragments. With these operations and within their limitations you can indeed work on arbitrary sub matrices of any bigger matrix. My talk "CUDA 9 and Beyond" at GTC discussed this API and it will be published soon. Full details will be available when we release CUDA 9.

Yes, Tensor Cores were used. These are preliminary results (pre-release hardware and software).

Martin, thanks for catching this typo! It is indeed 0.21 TFLOP/s.

Can you point me to some information on your 48V solutions?

Can you clarify what you are asking for?

Hi Mark, sorry, I should have been more clear.

Google is spearheading a new 48Volt architecture in the data centers. They have proposed a 48V rack to the Open Compute Project http://www.datacenterknowle.... The architecture allows for a 48V to 1V conversion for a GPU with only a one step conversion, and thereby skips the classic 12V intermediate bus.

It is my understanding that Nvidia has a board with a Volta GPU that will take a 48V input and convert the voltage to approximately 1V. I was looking for any information on your solution that allows for a 48V input to your GPU.

Thank you!

Hi Mark,

I want to check back with you and see if I provided enough information.

Thank you!

John

Though just as sli, crossfire has alot of issues unless you stick the the games that teuly support it. If not your either stuck with issues or just using ones of your cards.

And a pair of 580's will cost more than a 1080ti thanks to the mining craze (Vega will probably suffer the same fate).

Plus, it'll pull 4x the power and 10x the heat and half the performance on games that don't support multi-GPU.

Cool stuff. P100 was WORLD FIRST influential processor with double datatype computation efficiency commercially avaiable ( Intel Xeon Phi price is a joke - the same as P100 price ). Majority of non brainy-destructive usage of GPU will fit in brand-new GT730 4GB GDDR5 which costs approx. 50$ for Personal Computer ( mainly accelerated via pendrive live session ) obtained for free with high probability. Post Scriptum did anybody seen "E" letter in text documents? Post Post Scriptum There still will be a lot of people arguing that ones C#WrittenInC++ implemented on AMD=Intel is still faster and cooler than GPU CUDA C median programming example. P.P.P.S. It could efficiently compute FEM problems in theory.

Does the increase in FFT of around 1.7 also hold up for large FFT's, such as with 2^20 data points?

Can fp64, fp32, INT and TensorCores compute at the same time?

Either FP kind and INT can co-issue. FP and INT can also co-issue with memory instructions (this was true on Pascal). Tensor Core instructions can only co-issue with instructions that have zero operands (e.g. non-ALU instructions such as branches, predicate boolean logic, and others).

and
how many 4x4-matrix-multiplies does a Tensor Core need to reach its peak performance?