Basically, the IB stack and the GPU share the same pinned memory to which GPU data is cudamemcpyasync’ed and from which the IB RDMAs. Marketing numbers say +30% which of course depends on the frequency and data amount of D2H and MPI_send() communication, but speedups are reasonable because you tend to save a lot of redundant memcpy’s…
This kind of thing has been possible for a long time now, ever since GT200 and Cuda2.2. Ever notice you can extract Nvidia’s Linux-specific driver code from their Linux packages? Hook into their low-level physical resource allocator, and voila. Why this surfaced as a high-end, supercomputing-centric, trademarked “technology”, I do not understand. ZeroCopy is the real enabler. Unlike they imply, it is not direct gpu intercommunication. Maybe soon. http://insidehpc.com/2010/05/28/mellanox-o…the-first-step/
Tim, at GTC I’d be happy to both (a) justify it and (b ) buy you a beer. Although, if you guys have the complimentary drink tickets again, everyone wins!
I understand that there is much potential for exploitation if Nvidia increases the PCIe interface capabilities of the GPU, seeing as non-CPU virus execution is new territory.
I also understand that since it’s own discrete memory subsystem, this is necessary if you wish to decouple if from the CPU. Always routing through system RAM, even if CPU synchronization / memcpy is avoided, and even on newer integrated MMUs with HT/QP, is a point of contention. Obviously integrated CPUs/GPUs are a good middle ground; this is why I think the Atom ION is a great platform (seeing as Tegra doesn’t support Cuda, although you can do a lot with GL-ES).
Perhaps the issue is that most people are still fine with higher latency when transferring data in and out of GPUs. Gaming and supercomputing applications deal with massive amounts of data. My interest has shifted to real-world embedded apps, where its more critical how fast you can transfer one byte, as opposed to one gigabyte.