TITAN V Announced - 15.0 TFLOPs FP32, 5120 Cores, 12 GB HBM2 VRAM, $3000 US Price

15.0 TFLOPs FP32
110 TFLOPs mixed precision from 640 Tensor cores
Native FP16 capability unknown

If one would want DP capability this is actually a really good option, even at $3000.

Yay, finally tensor cores for the high end gamer … for more tension in gaming ;)

BTW hit me for purchasing that V100 earlier. This was pretty unnecessary in hindsight.

Did you buy it as part of DGX-1/DGX Station or just the card?

Just the card (PCI express version)

Interesting. I wasn’t even aware anyone was selling them standalone. How much was it if you don’t mind me asking?

Typical prices in Euros can be found here (these price quites already include 19% VAT, the VAT may differ slightly by country of purchase)



Wow, that’s definitely not cheap.

and the HBM2 speedup turned out to be rather disappointing for a random access workload.

I can not find double precision info in the specs. That means there are no DP cores, right?

There’s an article on AnandTech that claims it has half rate (so 7.5 TFLOPs) DP.

Ryan Smith is AnandTech’s Editor in Chief so it should be pretty legit.

Thanks JanetYellen. That is absolutely amazing!

“Concurrent copy and kernel execution: Yes with 7 copy engine(s)”

GV100 deviceQuery report.

Is that correct, 7 copy engines?
I use that capability all the time, hard to imagine using more than 2.

Trying to resist the urge to buy one of these, as my only DP capable GPUs are 2 Tesla K20c, and the Titan V is equivalent to at least 4 K20 GPUs in terms of 64-bit compute.

That device query show it is WDDM drive mode, but I assume that TCC drive mode is available.

BTW, for those using Windows 10 OS, the Windows 10 driver updates will take a Titan X out of TCC and put into WDDM. There are some fixes I have found but this is a consistent problem (probably more MSFT’s fault than NVIDIA’s).

They will have to pry my Windows 7 system from my dying hands as that has been my most stable workstation of all with the least amount of OS overhead, even when compared to my Ubuntu workstations.

That is a weird number of copy engines, I cannot imagine how that buys anything with PCIe, but I wonder whether that could be functionality built into V100 to make best possible use of NVlink?

I don’t have any special knowledge of V100’s hardware, but one of Volta’s big changes is support for full page-faulting memory access to the whole system (ie, direct mapping into the host RAM and into other GPUs.) Perhaps there are really only two copy engines as before, but they are abstracted a bit with a shared queue of pending accesses so that you can overlap multiple requests? A page fault may take extra time to resolve, you don’t want to lock up one engine waiting for that remapping to finish. Sort of like the way CPU hyperthreading inflates the apparent number of CPU cores to allow user code to efficiently use more of the CPU when one path stalls… maybe the “7 engines” are a hyperinflated abstraction that just encourages more simultaneous copies in order to keep the true dual engines busy when one copy stalls on a page fault.

This is just a fun guess. 7 engines just seems wierd otherwise.

CUDA 9.1’s devicequery still reports 7 copy engines.

7 copy engines doesn’t seem weird to me. There are 6 NVLinks + 1 PCIE link on Volta V100.

The only number of copy engines that would seem less weird to me is 14.

Ah, so it is connected to NVlink, as I suspected. But I assumed that a GPU supports either NVlink or PCIe, never both, meaning there would be re-use. Plus on high-end cards with PCIe, there have always been two copy engines, one to service each direction of the full-duplex PCIe link.

One of the dangers I have now that I have the “moderator” stamp on my user profile is that comments that I make may be interpreted as something they were not meant to be - i.e. an official pronouncement from NVIDIA. In this case, all I’m really saying is that it does not seem weird to me. It’s not intended to be a full explanation or ground truth.

Whether or not it is useful in practice, or how it is expected to be used exactly, or what the performance benefits may be, or how to witness it are all things that I’m not addressing.

GPUs that use NVLink (e.g. Tesla V100 SXM2, Tesla P100 SXM2, Quadro GP100) all use both NVLink and PCIE for all implementations, with the possible exception of Tesla V100 SXM2 on Power P9, in which the PCIE link is mostly out of the picture (taking a back seat on the CPU->GPU interconnect). For a scheduled data transfer (e.g. one initiated by cudaMemcpy, including cudaMemcpyPeerAsync) my understanding is that a copy engine is always involved, except for the possible corner case where the data transfer size is less than 64KB. I know its a special case, but not sure how it is handled exactly. (It may be entirely absorbed within a pushbuffer transaction.)

Anytime the host CPU is x86, the CPU-GPU data transfer will flow over PCIE, but these systems can still have NVLink connectivity between GPUs, and AFAIK a scheduled async data transfer there will involve a copy engine. Given that 6 links could conceivably connect to 6 different GPUs or endpoints (speaking generally, not with respect to any current/known implementation), architecturally 7 copy engines does not seem bizarre to me.

Which is why I said the only number that would seem less weird to me is 14.

Again, I am just enjoying some liberty to chit-chat here, without claiming gospel truth. I don’t know everything. Please take my comments here with a grain of salt. This is just the way things look to me.

You could also claim BS on this observation, since P100 didn’t follow this heuristic (it does not have 5, or 10 copy engines). I can do some hand-waving to explain why I think that is, but since its not a “ground truth” pronouncement from NVIDIA, I think its best I stop here.

Later: After some more thought about this, the “correct” number seems to be 8. We don’t need 2 copy engines for each NVLink, because in a GPU-GPU interconnect scenario, there is a GPU (with copy engines) on the other end of the link. Therefore 6 copy engines per GPU available for the NVLinks means I could have 2 per link, when considering both GPUs on each end of the link.

This would then leave architecturally a need for 2 for the CPU connection (e.g. over PCIE). So 8 is the correct number, and now 7 seems weird to me. I’m sure there is another missing piece of the puzzle.