From NIC to GPU.

Sellig_Dlefreh · April 20, 2010, 10:49am

any progress on that front ?
I would like to analyse ethernet trafic on my device. so fare I still need to use hosts resource… anyone has managed to bypass the full CPU stack and send data directly from NIC to GPU ?
Advice welcome

Sellig

Simon_Green · April 20, 2010, 12:06pm

This is currently no way to DMA directly into GPU memory from another device, and we have no plans on supporting this (at least in the short term).

There is some ongoing work to allow 3rd party Linux drivers to write into page-locked system memory allocated by CUDA, which avoids a CPU copy, but this has not been released yet.

However, it is still possible to get very high sustained throughput today (greater than 4GB/sec) by using page-locked memory and asynchronous transfers to overlap the copy and compute.

Sellig_Dlefreh · April 20, 2010, 12:40pm

Thanks Simon for your prompt reply.

jasonhoward · June 24, 2010, 8:14pm

What is the status of this work? I’d love to support this in our video capture board driver.

Thanks!

Jason

PeterW · June 24, 2010, 8:37pm

Well, I have sort of tried that. If I understand correctly it is the GPU that DMAs data from an address into itself, triggered by a cudaMemcpy(HostToDevice). I have exposed a device’s memory window over mmap to userspace and call cudaMemcpy(HostToDevice) ; the performance was not great, so perhaps it did not use DMA after all.

It would be terrific if Nvidia suported this at some point, as it is only natural that people using GPus for data crunching need to get the data onto the GPU, and data usuall comes from some device, isn’t it.

I would be interested to leanr about that, in particular, if I allocate some page-locked system memory in userspace, and use the same memory to DMA data into it from a device, but in kernel space.

Cheers,

peter

SPWorley · June 24, 2010, 10:56pm

Actually there are two examples of DMA into an NVIDIA GPU… one is via a Quadro video capture card, but the other much more relevant example is the recent announcement of an Mellanox Infiniband card that can DMA to a Fermi GPU directly.

The semi-official Nvidia line is that you can always hack the NVidia Linux driver code yourself and customize it for your own device. This is likely what NV itself did to help Mellanox. I can understand how tricky it would be to support it robustly in general for all possible PCIe devices and their very individual quirks, so this may be reasonable.

tmurray · June 25, 2010, 12:23am

That’s not what GPU Direct does at all–it allows the IB card and the GPU to share pinned memory. There is no support for direct peer-to-peer copies between non-NVIDIA devices at the present time.

SPWorley · June 25, 2010, 3:22am

Hoho, that’s interesting tp learn, since the white paper isn’t quite so clear about that. It says “The new GPUDirect technology from NVIDIA and Mellanox enables NVIDIA Tesla and Fermi GPUs to communicate faster by eliminating the need for CPU involvement in the communication loop and the need for the buffer copy.”

But I guess I (and others) misinterpreted that. It’s STILL copying to and from system memory, it’s only eliminating the need to copy from the Inifiniband pinned memory buffer to the GPU pinned memory. So there’s still two bus transactions (using twice the PCIe bandwidth that DMA would) and you still have the data being saved to system RAM.

OK, makes sense. Thanks, Tim!

tmurray · June 25, 2010, 4:56am

Okay, full stop. It’s using exactly the same amount of PCIe bandwidth as a direct transfer would. The difference is now you’re using a little bit of QPI and DRAM bandwidth on the host machine as well. On a machine where those two interconnects are not bottlenecks, the only thing that a direct copy would save you versus a trip through pinned memory is a tiny bit of latency.

(can you tell people ask me about this a lot and always have the same misconceptions about it)

PeterW · June 25, 2010, 7:40am

Perhaps this confirms my misconception, but it is hard to imagine that the PCIe bandwidth is the same. If I have two cases for example 1 MB:

A) menage a troi

Three parties are involved: device, host and GPU

First, transfer 1MB from a device through the PCIe bus to memory on the host shared between device and GPU driver, also involving the transfer from the PCIe swtich (e.g. in an Intel 5520 host controller) over QPI to the CPU and from there to RAM.

Second, transfer this 1MB from shared memory onto the GPU through the PCIe bus, in reverse order involving RAM to CPU and CPU to PCIe switch (through QPI), and PCIe switch to GPU.

Sum total: 2MB have been transferred on the PCIe bus; the PCIe switch sees two 1MB transfers.

B) peer to peer

By some magic, a transfer between the device and the GPU only involves the PCIe switch that establishes x lanes to be dedicated for this transfer. The CPU does not see a single byte comming through, nor does the QPI. Sum total: 1MB is transferred.

In case A, which seems to be implemented in the Mellanox case, how could they achieve to provide the same pinned system memory to both the GPU (hence cuda) and the Infiniband driver. I would like to learn more about that as I would need to implement the same. But I have no idea how to hack into cudaMallocHost() such that it takes a pointer provided by my device’s driver.

Cheers,

peter

tmurray · June 25, 2010, 4:53pm

In case B, you’re transferring 1MB from the device to the chipset, then 1MB from the chipset to the GPU. Same PCIe bandwidth used, but you save on QPI and DRAM bandwidth. The number of transfers doesn’t matter, the same amount of data is crossing the PCIe bus regardless of whether it changes directions at the chipset or at CPU DRAM.

PeterW · June 25, 2010, 6:19pm

Well, yes and no.

What you describe as two separate transfers, I would rather see as a single transfer involving two leafs hanging off the PCIe hub, i.e. peer to peer. And I guess that the hardware is so clever to create transactions over all lanes from my device to the GPU. This counts for me as one transfer, and I would expect it to take t = 1MB/commonBandwidth seconds; commonbandwidth will be determinded by the PCIe switch depending on the availabe lanes to both leaves. Also the chipset does not store data I believe.

In case A, there are two subsequent transfers, the first from one leaf off the PCIe switch (my device) to RAM through the CPU and QPI, taking t1 = 1MB/myDeviceBandwidth seconds, followed by a second transfer from RAM to another leaf off the PCIe switch (the GPU) taking another t2 = 1MB/gpuBandwidth seconds. So the time required is t = t1+t2.

Provided both myDevice and the GPU offer the same PCIe bandwidth, case A would take twice as long and also use more CPU to RAM bandwidth. Of course, there would be data transfer over both branches off the PCIe switch, but in case A they are in sequence, in case B they happen at the same time.

That’s my reasoning and I might be mistaken.

Cheers,

peter

tmurray · June 25, 2010, 6:30pm

You are mistaken about bandwidth, although you’re getting at the issue where direct transfer actually helps. Direct transfer reduces the latency of inter-device communication but has no impact on PCIe bandwidth utilization. In terms of latency, if you have to wait for the entire 1MB copy to host to complete before starting the transfer to the other device, it will take t1+t2. However, that’s really easily fixed–just copy smaller pieces and begin copying when possible. Let’s say you copy eight 128KB chunks instead of one 1MB chunk. Assuming equal bandwidth to all devices, you only have to wait for the first transfer to finish before you can start the “1MB” copy to the second device. As a result, then your time is (9/8)t1. You can get arbitrarily close to t1, although the overhead of starting transfers will start to dominate if you get too small (I think 64KB is doable).

So overall there’s not a big win to direct transfers until you’re completely latency constrained.

PeterW · June 25, 2010, 7:26pm

I take your point. And I agree, if the transfer is done interleaved with small chunks of data, then the result is nearly the same. Yet, there is conceptual difference. Viewing the situation like a “Y” with at the top leafs sitting PCIe devices, at the center a PCIe hub and at the Y’s bottom the CPU/RAM: Case B (peer to peer) only involves the center and the two top branches, case A (peer to RAM, RAM to peer) involves the branch down to the bottom of the Y and back up again. More components involved, from my gut’s feeling just more potential undetermistic behavious and interference with other stuff going on.

At the moment I am doing simple trials with a “small” system processing 80MB/s. our final system will have to process 2GB/s on 2 Teslas and 2 FPGA input cards, on a Tyan 7015 motherboard (two CPUs, two Intel 5020 chisepts,…) and my concern is that there the stress on the components involved, also on the QPI migth become an issue. And that’s why I would feel much more comfortable if I could trigger a peer to peer transfer, and once completed be notified and start processing a data frame (16MB every 8.5ms).

Cheers,
peter

tmurray · June 25, 2010, 8:41pm

I understand that it seems like doing direct transfers will be more reliable and predictable; however, that’s not necessarily true as this kind of direct transfer gets into chipset-specific territory in a big way.

PeterW · June 25, 2010, 9:43pm

OK, than I shall try the “mydevice → RAM followed by RAM → GPU” approach. There will be some implementation challenges though. The way I know to do things would be to have myDevice’s driver DMA blocks of data into its own kernel memory, then expose the driver’s memory window to user space, notify a user program upon availability of a frame, then in userspace, memcpy from the driver’s published memory window to cudaMallocHost pinned memory, and then issue a cudaMemcpy(hostToDevice). So I have an additional memcpy.

It would be preferrable to allocate in myDevice’s driver some contiguous memory, from a reserved region during boot-up, say a big chunk, 4GB. Then, check out what calls cudaMallocHost() makes into the nvidia kernel driver (if so), and intercept a call to this special size (4GB), in which case, a modified nvidia driver could return the same memory window as myDevice’s driver uses. The rest is careful management of this shared region by both myDevice’s driver who DMAs data into it, and a user space program that issues cudaMemcpy(hostToDevice) which now hopefull uses DMA as well. But that’s just an idea as I don’t have any other solution that avoids copying between two processes memory spaces.

Perhaps, tmurray, you have got an idea?
Cheers,
peter

SPWorley · June 26, 2010, 3:12am

Threads like this are the reason that these CUDA forums are the best resource for GPU programming and research. You’re not going to find information like this kind of GPU/DMA discussion from Google or even asking your local “CUDA guy.”

And obviously Tim’s a big part of that reason… obviously in this thread, but really in the whole CUDA forum. Thanks, Tim! There’s always more to learn!

So, back to the DMA questions. Googling lots of PC DMA examples turn up references to network cards and FPGA cards which seem to be using this same pinned memory paradigm as well. Does all PC architecture DMA involve using pinned host memory as a central hub, or are there any hardware PCIe devices that do DMA from device to device without intermediate host memory storage at all? For example, are SLI GPUs sending partial frames to one another via host pinned memory?

PeterW · June 26, 2010, 8:09am

When using pinned host memory as a central hub, a practical questions still remains unsolved. How actualy implement this concept.

A. The device driver allocates pinned memory in the Linux kernel and by some magic the user space program issueing a cudaMallocHost() can retrieve a pointer to this pinned memory as user space virtual memory address, an address that is then recognised by any cudaMemcpy(HostToDevice) as pinned memory such that cuda will make use of DMA to transfer data to the GPU.

B. The user space program issues a cudaMallocHost() first, and somehow can tell the device driver the memory location, converting the user space virtual memory address into a kernel virutal memory address (and perhaps into a physical address) that now allows the device DMA data into.

Which variant is practical, and is there some sample code available? I only believe it when I see the code.

Cheers,
peter

gogogpu · June 28, 2010, 12:29pm

Peter, so far option (B) is most practical. That is, allocate pinned, mapped memory using cuMemHostAlloc() and pass it down into your kernel driver. There, you can retrieve the physical address (using find_vma() on linux) and initialize DMA as neccessary.

I have been using this method for FPGA → GPU pipelining, where the FPGA is actually an advanced 10GbE NIC that does preliminary image processing on incoming video streams. The performance gains have been very good (greatly reduced latency and CPU usage), but it is fairly unstable and crashes the system often. I think the NV driver changes the mapping or something while the FPGA is still writing to it.

The very top ‘Release Highlight’ of Cuda 3.1 is ‘GPUDirect™ gives 3rd party devices direct access to CUDA Memory’. However, there is absolutely no reference to this in the toolkit itself. Anyone care to clarify?

Mu-Chi_Sung · June 28, 2010, 1:27pm

Peter, so far option (B) is most practical. That is, allocate pinned, mapped memory using cuMemHostAlloc() and pass it down into your kernel driver. There, you can retrieve the physical address (using find_vma() on linux) and initialize DMA as neccessary.

I have been using this method for FPGA → GPU pipelining, where the FPGA is actually an advanced 10GbE NIC that does preliminary image processing on incoming video streams. The performance gains have been very good (greatly reduced latency and CPU usage), but it is fairly unstable and crashes the system often. I think the NV driver changes the mapping or something while the FPGA is still writing to it.

The very top ‘Release Highlight’ of Cuda 3.1 is ‘GPUDirect™ gives 3rd party devices direct access to CUDA Memory’. However, there is absolutely no reference to this in the toolkit itself. Anyone care to clarify?

I did try to install CUDA 3.1 and create infiniband memory region (which is the page-locked memory in infiniband’s terminology) on CUDA pinned memory with no success. Not sure if it’s related to hardware since I am using Mellanox InfiniHost IBA instead of ConnectX IBA with Tesla C1060. I will try to use driver api and verify if newer hardware (new Fermi + ConnectX IBA) would work or not when ordered parts are delivered.

In the mean time, tmurray, could you give us a pointer on this? any example code or just a short answer about how to use the new feature in CUDA 3.1 would be great! Thanks a lot!

Topic		Replies	Views
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8088	June 30, 2010
cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst CUDA Programming and Performance	23	5646	May 8, 2012
CUDA 4.0 CUDA Programming and Performance	63	507394	March 28, 2013
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204314	April 13, 2009
Unified Memory in CUDA 6 Technical Blog	87	1895	August 16, 2019
Establishing GPU processor and memory usage CUDA Programming and Performance	26	18923	November 17, 2008
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2686	April 11, 2019
Method to Cycle steal DMA write into DDR5 CUDA Programming and Performance	7	900	December 8, 2017
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1146	April 17, 2023
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1199	May 14, 2019

From NIC to GPU.

Related topics