Memory from peripheral devices to GPU DMA directly to another device...

Hi everyone,

until now we used CUDA to evaluate if the performance of GPUs are good enough to replace an existing FPGA platform. The results are very satisfying and now we want to move on from synthetic tests to a real integration in the existing environment. The data, which needs to be processed, must be pushed from a specific PCI-E device (now simply called ‘board’) to a CUDA enabled GPU. There are mainly two ways to do that:

First one:

		 DMA		   DMA

board --------> RAM --------> GPU

Second one:


board --------> GPU

The first one should be possible without big trouble. The CUDA application allocates some memory with cudaMallocHost() and the pointer/address is used by a driver which initializes the DMA via programmed I/O. The pointer just needs to be transferred to the driver and mapped to a physical address. But the host RAM would be a bottleneck.

The second one is more complicated, but without the bottleneck of the RAM. The board has DRAM which must be mapped in the address space of the host. Then the address’ of the mapped DRAM must be used by cudaMallocHost(), respectively by cudaMemcpyAsync(). But is this possible, is there any way to accomplish that? Has anyone tried to do this (successful or not)?

I would be happy to get some hints or comments! Thanks in advance.

Sorry for pushing the thread, but has nobody tried to transfer data directly into the gpu from a self developed device? I cannot believe that we are the only ones who are trying to do that…

Well, since NVIDIA itself has yet to enable direct Tesla to Quadro transfers without using host memory, I doubt that anyone else has succeeded. It’s obviously something people want, though…

People have been requesting this DMA directly from a device since the days before CUDA 0.8. NVIDIA has always said that they are thinking about it. It is also in the FAQ:

Sorry to make this post sound like a “just read the FAQ or google it” one, but it’s hard not to when it is on a topic that has been discussed many times.…lient=firefox-a

Maybe, just maybe NVIDIA is finally actually putting their ideas to code and Tim will jump in this thread and drop one of his subtle hints about upcoming features… but I wouldn’t hold your breath. CUDA 2.2 completely revamped the way pinned memory is handled in the driver and it didn’t even bring this feature.

Thanks for your replies.

I used the forum search function and did not find interesting topics. Google is much better, so sorry for opening a new (probably) worthless thread. But maybe someone who implemented the first possibility (with 2 DMA transfers) can say somethink about the achieved bandwidth/latency…

Thanks again.

nope, nothing to announce/subtly hint. believe me, I’d like this kind of thing too.

Because you say that your board’s memory exists in the host address space, then your transfer should look like this:



RAM --------> GPU


Right? Or am I missing something?

And if I’m right, the RAM on your board must be dual ported, correct?