Confused about GTX Titan Z Peer-To-Peer (P2) capability

Somebody in topic

https://devtalk.nvidia.com/default/topic/796587/cuda-programming-and-performance/peer-to-peer-transfer-failing-on-geforce-gtx-titan-z/

runs ./SimpleP2P sample (included in CUDA 6.5) and gets the following result:

GPU0 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)

Can somebody enlighten me why MY Titan Z “NOT capable of Peer-To-Peer (P2P)”, running the same test?

What operating system are you running? As far as I know peer-to-peer is only supported on 64-bit OSes and the application itself must be built as a 64-bit binary.

Windows 7 Pro 64bit

yes, it is built as x64. At least it’s located in win64 folder. There is another one there, win32.

I think (but am not sure at all) that peer-to-peer on Windows may be limited to devices running under TCC driver control, and does not work under the default WDDM driver. I do not have personal experience with peer-to-peer on Windows, so treat that as a working hypothesis for now.

[Later:] At least when it was first introduced, peer-to-peer required the TCC driver on Windows, according to slide 7 of the following slide deck:

http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf

Given how significantly the OS interferes with the GPU memory space when the WDDM driver is used, I consider it unlikely that this has changed, but again, I am not sure.

[Even later:]
Here is authoritative information from the CUDA 6.5 Best Practices Guide:

http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
9.1.4. Unified Virtual Addressing
Devices of compute capability 2.0 and later support a special addressing mode called Unified Virtual Addressing (UVA) on 64-bit Linux, Mac OS, and Windows XP and on Windows Vista/7 when using TCC driver mode. […] UVA is also a necessary precondition for enabling peer-to-peer (P2P) transfer of data directly across the PCIe bus for supported GPUs in supported configurations, bypassing host memory.

Here is what I’ve got from CUDA_Getting_Started_Windows.pdf:

“NVIDIA GeForce GPUs do not support TCC mode” (page 6).

I mean wtf NVIDIA, WTF???

“I mean wtf NVIDIA, WTF???”

i think you are wrong, it should be:

“I mean wtf windows, WTF???”

I took it from some forum: “The support for TCC is driver-dependent… basically NVIDIA locks you out of those features given you’re buying a consumer card to protect their market.”

Your comment - does it mean that TCC driver for some other operation system works just fine? (Linux, perhaps?)

The TCC driver is a special non-graphics driver offered only on Windows platforms as an alternative to the standard WDDM graphics driver which incurs a lot of overhead due to deep interference of the OS with GPU operation. As you noticed, the TCC driver is not supported with all GPUs.

On Linux there is only one driver variant. You may want to re-read the section I quoted from the Best Practices Guide. I have only ever used professional GPUs (Tesla, Quadro) and will therefore refrain from making statements regarding peer-to-peer functionality on consumer cards, as I have no hands-on experience.

Somebody tell me:

  1. is peer-to-peer works at all on Titan Z?
  2. what are the conditions? (i.e. do I have to install Linux for that?)

Here is some quote from here: https://developer.nvidia.com/gpudirect (NVIDIA web site!)

“GPUDirect peer-to-peer transfers and memory access are supported natively by the CUDA Driver. All you need is CUDA Toolkit v4.0 and R270 drivers (or later) and a system with two or more Fermi- or Kepler-architecture GPUs on the same PCIe bus.”

It seems like the fact is: Windows support for CUDA is really limited.

Who knows whos fault it is. But it is limited. And people have to know it. NVIDIA at least MISINFORMS customers. (and this has to be corrected and acknowledged.)

(Linux support is still unknown!!!)

Correct me if I wrong, but I always thought that with the consumer GTX 2-in-1 GPUs the memory transfer rate between the two distinct GPUs was the limit of the Host PCI-E bus, which for PCI-E 3.0 x16 is theoretically 16 GBs.

GPU direct RDMA is a special feature of the Tesla and high end Quadro lines.

Just out of curiosity what is the output of the p2pbandwidthLatencyTest application (in the CUDA samples) for the Titan-Z ?

For two distinct consumer GTX GPUs in my system this is my output:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 980, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 780 Ti, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
P2P Cliques:
[0]
[1]
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0  83.32  10.16
     1   9.95 136.08
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0  83.47  10.23
     1  10.01 134.58
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0  83.52  10.53
     1  10.41 135.64
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0  83.43  10.60
     1  10.43 135.59
P2P=Disabled Latency Matrix (us)
   D\D     0      1
     0   4.69  65.39
     1  58.34   4.66
P2P=Enabled Latency Matrix (us)
   D\D     0      1
     0   4.77  62.53
     1  60.14   4.64

I am not even quite sure how to interpret some of those numbers.

Yea, you kind of wrong. If you read above Titan Z doesn’t support P2P. On Windows at least. I am still investigating Linux.

“For two distinct consumer GTX GPUs in my system this is my output”

i am curious, are the 2 gpus slotted in different slots (i.e. x16, x8, x4)?
that could perhaps explain the bandwidth discrepancy between the devices

and, keeping njuffa’s point in mind:

“I have only ever used professional GPUs (Tesla, Quadro) and will therefore refrain from making statements regarding peer-to-peer functionality on consumer cards”

and are you running linux here?
the output of p2pbandwidthLatencyTest seems to support p2p functionality for “(distinct) consumer GTX GPUs”

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 980, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 780 Ti, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
P2P Cliques:
[0]
[1]

I see no point in digging out old information from the CUDA 4.0 time frame. NVIDIA should remove such old information to avoid confusion. One would always want to consult the most up-to-date documentation available and keep in mind that any documentation could have bugs, just like software. Except it is easier to find the bugs in software as there are tools to help with that.

In this case, I already quoted relevant information from the CUDA 6.5 Best Practices Guide:

http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
9.1.4. Unified Virtual Addressing
Devices of compute capability 2.0 and later support a special addressing mode called Unified Virtual Addressing (UVA) on 64-bit Linux, Mac OS, and Windows XP and on Windows Vista/7 when using TCC driver mode. […] UVA is also a necessary precondition for enabling peer-to-peer (P2P) transfer of data directly across the PCIe bus for supported GPUs in supported configurations, bypassing host memory.

To the extent of my knowledge I do not see anything in the above that is ambiguous or could be construed as misinformation. I would interpret the above as follows:

  1. peer-to-peer transfers require UVA support
  2. UVA support requires both
    a. GPU with compute capability >= 2.0
    b. use of a 64-bit operating system. User’s pick of:
    i. Linux
    ii. Mac OS X
    iii. Windows with TCC driver

What I think may be missing in the above is that not only does the OS have to be 64-bit, the app itself must also be 64-bit since UVA requires a 64-bit address space. I am not 100% sure as I have never created 32-bit apps to run on 64-bit operating systems; I always build native 64-bit apps on them.

When errors or lack of clarity in documents are encountered I would suggest filing a bug report or enhancement request with NVIDIA using the form linked from the CUDA registered developer website. In my observation documentation issues get fixed in the next release cycle as long as such a report is not filed in close proximity to an upcoming release.

What if to discover the error, I need to buy the product first?
What if I based my purchase decision on a commercial where this errors (or intentional neglect) appeared?

(Because I would consider https://developer.nvidia.com/gpudirect THE source of information, not some obscured white paper that I read after I bought and installed everything, 10k rig included?)

Thanks you for your suggestion, though. I respect NVIDIA because I understand that they do most superior products available. It’s just they don’t work as advertised. That’s all.

The p2p test I ran above was in Windows 7 64 bit with two different GPUs in different slots in the same PC.

Does the Titan-Z prevent one from running that test? That test should at least give you a hard number of the actual bandwidth between the two GPUs in the same board.

I run that test. On Windows. And then on Linux.

On Linux it’s supposedly goes over P2P and about 15-20% faster. On Windows it transparently (behind the scenes) goes through RAM (GPU1->CPU->RAM->CPU->GPU2)

By the way both Linux and Windows consider Titan Z as 2 separate cards. It’s not one card.

It is one physical card with two GPUs, each with its separately attached memory. This organization applies to all of NVIDIA’s dual-GPU solutions, whether in consumer space (e.g. GTX 690, GTX Titan Z) or professional space (e.g. Tesla K10, Tesla K80). The drivers merely reflect how the hardware is organized. Dual-GPU solutions come in handy when trying to achieve high-density solutions that might be limited by available PCIe slots or space otherwise.

yes, it acts as 2 separate cards. In one PCIe slot.

“yes, it acts as 2 separate cards. In one PCIe slot.”

finish the sentence:

“yes, it acts as 2 separate cards. In one PCIe slot. because it has an ‘onboard’ pci switch rather splendidly facilitating the casting of what seems to be 2 separate cards as 1 gloriously joint card”

if you run lspci -vb, you are sure to spot the pci switch between the 2 gpu chips, smiling and waving back at you whilst enjoying some tea and a warm bacon and cheese sandwich