Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel,


I am using a GPU (Tesla for the time being) for radar data processing. In this scenario some bespoke hardware, a PCIe plugin card captures and accumulates radar data. Once a certain amount of data is accumulated (4MB), it has to be transferred to the GPU for processing (every 8ms). The challenge is both to get the data onto the GPU and to carry out the algorithms in a timely fashion.

Ideally I would like to trigger the data transfer between the data capture card onto the GPU directly and be notified once the transfer has completed. So the CPU only orchestrates but is not involved in data shuffling. I understand that there is no such support in the NVidia driver for peer to peer data transfer so far, but could there be? I could imagine that many applications face a similar problem, and it is simple not efficient nor is it elegant to transfer all data to RAM first.

Failing a peer to peer data transfer, my driver for the data capture card transfers data to reserved, page-locked kernel memory, which I can mmap into user space. Now, how do I efficiently copy the data to the GPU?

  1. host to device copy from the mmaped region. This might be slow.
  2. copy all data from the mmaped region into a cuda-host-malloced area and then host to device copy.

Is there a more elegant way of transferring data that lives in a page locked memory area?
Could I tell the nvidia driver to use my mmapped region as if it was cuda-host-malloced?

Last, what if I would like to do all that from the kernel? Could I talk to the GPU from a kernel module, i.e. is the API available in kernel space? Could I launch kernels from a kernel module?

My aim is to write a as-real-time-as-it-gets application that transfers data from the data capture card to the GPU and launches a few kernels every 8ms. It won’t be much code, but it has to execute in a deterministic fashion. That’s why I would prefer to stay in the kernel, synchronised to the data capture card.

Many thanks for any hints.
Kind regards


are my questions too unusual, strange, or out of range?

Nvidia, can I copy data from a bespoke PCIe card onto the Tesla, e.g. by setting up a DMA transfer, or some prefetch mechanisms? Perhaps not now, but real soon?

This is very vital to our application and I would be very gla dif it was possible.


I would check the inifinband thread from a few days ago. For infiniband something like this is possible, and someone posted that on linux a small modification of the open-source part of the nvidia kernel driver should also allow this.

Unfortunately I am not able to find the thread you mention, neither by looking through the past few weeks, nor by the Search this forum mechanism. Would you have a link to the thread, or its title?

google works better than the forum search :) :…q=&gs_rfai=

This is the topic:…p;#entry1069704

Many thanks for the link; it looks interesting, for people using Infiniband though.

It still does not address what I would like to achieve: Get a data acquisition card copy captured data directly onto the GPU card. I may be completely mistaken and this kind of peer to peer copy is not possible on PCIe, or it depends on the PCIe hub, and who can become bus master. But the bottom line is: GPUs can be used in many wasy, one important one is processing of captured data, which in many cases is accomplished by some PCIe peripheral or FPGA card. What is the point of having 1TFLOPS if you cannot get the data onto the system?

As mentioned earlier, if I cannot have a peer to peer copy, then I have to find an efficient way of transferring captured data onto the GPU with a minium amount of copying. Since I capture the data into memory reserved at boot time, and efficient transfer of data to the GPU only works well with cudaMalloc’ed memory, how can I tell cudaUseMemoryAsPinned()?


Did you read the post about changing the low-level allocator in the driver? There is a post about how the infiniband trick was already possible by changing some part of the driver. I think it should that way be possible to map a region of your capture card into (pinned) memory and use that pinned memory either with cudamemcpyasync or directly in the kernel through means of zero-copy.

Hello again,

Well, I read the post that mentioned changing the low-level allocator in the driver. But I did not see any links to code, or instructions on how to actuall do it.

Do you really mean to mmap a region of my capture card, e.g. by implementing the mmap file ops, i.e. the PCIe device memory window is exposed to user space. If I then issued a cudamemcpyasync from this reagion onto a cuda device memory address, would the CUDA driver really carry out a device to device copy? How would the cudamemcpyasync call know that this is pinned memory? Does cudamemcpyasync carry out tests before copying to find out if the memory is in DMA-able regions and the physical pages are contiguous?

Or did you mean memory allocated by the driver but still RAM and not IO space?

I think I do not quite understand. Does this mean copying from one card onto the GPU without copying data to RAM, or does it mean, that the GPU can work off the capture card memory directly, so really zero copy?

I am more confused than ever.



Are you still in a planning stage, or do you have some initial results that are problematic? How far are you from reaching the 8ms deadline? Does it break consistently or only sometimes, in which case the RT patches might help (actually, I believe that would be a prerequisite.)

I’m not sure if it’s what you’re looking for, but I stumbled upon this OpenGL extension.
Never tried it myself, but if it works you could use cuda-gl interop.


As a matter of fact, I am working on a the implementation of a smaller Proof of Concept system.

Data have to be collected in smaller chunks from the FPGA card at ~ 2KHz. I started off by using a userIO based driver that catches an interrupt and wakes up a process having selected the corresponding device file. Once woken up, the user space part read from the capture card’s memory window, exposed to user space by the userIO driver. But it turned out to be not reliable enough, wakeup at 2kHz made the user space miss a few windows at no load and quite a few at more load.

So I now remain in the kernel and accumulate data into reserved memory. Every 8ms, a data frame is complete and I can wake up a “Tesla” process, I have to transfer the entire frame onto the GPU for processing. You are right, I will have to use some RT measures, set scheduling prioities or so.

For the small system the memory transfers are not so much of an issue, but the full system will have to cope with 2GB/s and here I would wish to be able to have the data transferred moreless “automagically” from the data capture card onto the Tesla at a certain frequency. And a user space program would be woken up only every 8ms tp launch the CUDA kernels on the new data.


Thanks for the hint. I am not quite sure whether I understand the purpose of the extension; must be something around video capture from a video capture device, possibly a device on the system bus and not on the graphics card itself. I could not quite see how a data pipeline between two devices on a bus could be established, in particular if the two devices are quite different, one GPU card with an NVidia driver, and in my case an FPGA card with its own driver.

For me it starts boiling down to the following questions:

  • Does PCIe allow for peer to peer data transfers, and if so, what support is required from participating cards.
  • Can I request a transfer of a data block from one PCIe device to another PCIe device or would this be triggered automagically with a memcpy()?
  • If I know the source PCIe memory window of my capture card, how do I get the destination memory window of a preallocated memory region on the GPU?


Think the other way around. If you have a pointer to (non-swappable) memory that actually points to the memory window of your pcie card, you can have your GPU DMA that memory in one go from the device to the GPU.

I have no idea if you can map that pcie-card memory into the normal memory space, but I can imagine that it will default be non-swappable. Then the only thing you need to do is:

  • modify cudaMallocHost to also allow a pointer given by you as input, and let CUDA think it is standard pinned memory, that is where the driver-mod comes in.

A cudamemCpy is actually a DMA by the GPU from non-swappable memory. If you have allocated normal memory, the runtime make a small pinned memory buffer, transfers some data to that buffer, let the GPU fire off a DMA. Then it copies the next bit, etc, etc.

Hello and many thanks for the mind opener!

I always assumed that DMA transfer from the host to the GPU is controlled by the host, i.e. by the NVidia driver. But Yes, the GPU can DMA and be the bus master, too. If I understand you correctly, this is what’s happening in a cudaMemCpy()

The PCIe-card memory can certainly be ioremap’ed in the kernel into kernel virtual adresses, and by an mmap file operation of a char driver into user space.

So memory addresses have “colours”, i.e. cudaMemCpy() can tell from the address itself whether it is pinned or swappable?

If the driver for my PCIe-card implements the mmap file operation and this function uses remap_pfn_range to map the PCIe-card memory window into user space, would the resulting memory address look like a pinned memory address to cudaMemCpy()? Would cudaMemCpy consequently make use of DMA transfer from the source address (on the PCIe card) onto the GPU? In this case I would not need to modify the cudaMallocHost() function?

If not, cudaMallocHost is a library function and I do not have the source code, so it would be difficult to modify it?




following E.D. Riedijk’s explanations, I tried out three different types of memory transfers to the GPU for a 16MB block:

  1. host malloc’ed memory

  2. cudaMallocHost (pinned) memory

  3. PCIe device memory

For the last case, I have implemented the mmap file operation of my Xilinx PCIe card driver to map the PCI memory window:

static int xilinxpcie_init(void) {


	pcie_mem_hwaddr = pci_resource_start(pcie_dev, 0);

	pcie_mem_length = pci_resource_len(pcie_dev, 0);



static int xilinxpcie_mmap(struct file *filp, struct vm_area_struct *vma) {


	remap_pfn_range(vma, vma->vm_start,

		pcie_mem_hwaddr >> PAGE_SHIFT,

		pcie_mem_length, vma->vm_page_prot);



The user space program mmaps the region and then calls a cudaMemcpy:

int length = 16*1024*1024;

int fd = open("/dev/xilinxpcie", O_RDWR | O_SYNC);

h_data_mmap = mmap(0, length, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 4096);

h_data_malloc = (char *)malloc(length);

cudaMallocHost((void **)&h_data_cudaMalloc, length);

cutilSafeCall(cudaMalloc((void**)&d_data, length));	


if (type == MEMORY_TYPE_MMAP) {

	cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_mmap, length, cudaMemcpyHostToDevice));


else if (type == MEMORY_TYPE_MALLOC) {

	memcpy(h_data_malloc, h_data_mmap, length);

	cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_malloc, length, cudaMemcpyHostToDevice));


else /*if (type == MEMORY_TYPE_CUDAMALLOC)*/ {

	memcpy(h_data_cudaMalloc, h_data_mmap, length);

	cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_cudaMalloc, length, cudaMemcpyHostToDevice));


What I have measured for 16MB of data is:

A. To transfer 16MB from the PCIe xilinx card to RAM using memcpy into

  1. malloc’ed host memory: 1711ms

  2. cudaMallocHost pinned memory: 1701ms

B. Transfer from host to device takes

  1. malloc’ed host to device transfer: 9ms = 1777MB/s

  2. cudaMallocHost host to device transfer : 5ms = 3200MB/s

C. Put together (1. and 2. require a memcpy first):

  1. device to device through host malloc’ed memory: 1720ms

  2. device to device through cudaMallocHost (pinned) memory: 1706ms

  3. PCIe device to device transfer: 880ms

So it appears that the peer to peer transfer is twice as fast as the transfer through the host, even though the transfer speed is still disappointing.

The conclusion is it might be better to DMA data from a PCIe data acquisition card into RAM first (hopefully into cudaMallocHost pinned memory, usuable by the xilinx driver?) and then transfer onto the GPU.

I hope this findings are usefull to some,


You have written in A that it takes 1.7 seconds to transfer - from your own card to memory - about the amount of data you would need to transfer in 8 milliseconds? That is either a typo or one hell of a bottleneck :-D

Anyways, the people into framegrabbers and video extensions - like the one Nico mentioned above - appear to hang around over at Nvnews;

There are Nvidia devs there as well. Give it a try?

Hello Jma, it’s not a typo unfortunately,

I have a 8ms window to transfer all data and carry out the processing. Well, I could imaging triggering an async data transfer into one part of GPU RAM, then start processing the previous data set. In this way, I have 8ms for data transfer and 8ms for processing, with the effect of introducing an additional delay of 8ms, but that’s OK.

I think that the 1.7s for 16MB were simply due to the rather unsophisticated use of memcpy, which I guess is a while loop transferring byte per byte. In the kernel arch/lib there are better implementations in copy_page.S which makes use of the prefetch property of some addresses. But I do not know whether glibc’s memcpy also has access to that.

So my problem still remains: How do I transfer data in a deterministic and efficient way from an data acquisition card onto the GPU, A. peer to peer DMA, or B. DMA from the DAQ card onto pinned memory, and then DMA to the GPU. Case A. does not seem to work, or the cudaMemCpy does not recognise the mmaped PCIe address as DMA-able. Case B. would require that the the GPU recognises the memory allocated by the DAQ card as pinned memory and use DMA to transfer to the GPU.

I wonder whether I’m alone in my -it seems- peculiar demands. But then I ask myself: what would you use a GPU for? For simulation, yes, but also for data processing. And where does the data come from? It’s probably not a program running on a CPU that produces the data; most likely data is produced somewhere outside the PC and is fed into the PC using some peripheral such as a PCIe card. Consequently, everybody who does serious data processing on external data must face a similar problem: route data efficiently from your peripheral onto the GPU.

Kind regards,

The graphics card can process current data while simultaniosly transferring the next, so you don’t have to worry about that right now.

It is that your data aquisition card isn’t moving anywhere that is the problem. How would xilinx go about to transfer one frame to memory efficiently? They must have some way of getting you above 2 frames/second

Well, I don’t know yet. Xilinx provide a rudimentary Linux driver for the card we are using for development, a Virtex-5 LXT ML555. This driver uses memcpy to do data transfer from the PCI memory to a userspace in the read file operation and the reverse in the write file operation. There is also a stub for DMA, but I have the impression some code is missing (it says in comments: code to do DMA goes here).

In a previous trial I mmap’ed the PCI memory address to user space and used memcpy to transfer the data into cudaMallocHost pinned memory. I guess that the glibc implementation of memcpy might do better then copying byte by byte, perhaps 8 byte long words, but it won’t do a DMA.

What I am going to try next is to measure memcpy in the kernel. From linux/arch/x86/lib I can see that the implementation is quite clever, so it could be faster. Then I’ll need an implementation of DMA into reserved memory. Unfortunately for this case, I’ll still need to memcpy from kernel space into user space, e.g. the into cudaMallocHost pinned memory. It all sounds very convoluted and uncecessarily complicated and inefficient.

Another idea, imagine in the userspace I issue a read and provide the pointer obtained from cudaMallocHost. In the read file operation in the kernel, I convert the address from virtual into physical, and then DAM into it. But can I convert a user space virtual memory address into kernel virtual address or even a physical address?

Failing that, it has to be the other way round. The FPGA driver uses reserved memory and through an ioctl this adress can be retrieved into user space. Then a call into cuda with this pointer as an argument will make the memory region appear as pinned memory to cuda. But in this case I would have to change the code to the cuda lib for which I do not have the source code. Stuck.

I am running out of ideas. Data transfer from the FPGA card to the GPU is an essential pillar for our application. If it cannot be done, we have to look into alternatives.



If Xilinx provides an ioctl, my guess is that this will be the best way to get the data from the card into real memory, pinned on the DDR sticks so to say. Just because you can map the Xilinx cards own memory into userspace doesn’t mean that the horrendous red tape around every access disappears. If you do not transfer all in one single chunk, then that explains why performance is so low. Experience from Nvidias cards suggest that the initial overhead equals (approx) 10KB of transferred data. So byte by byte transfers will take forever.