until now we used CUDA to evaluate if the performance of GPUs are good enough to replace an existing FPGA platform. The results are very satisfying and now we want to move on from synthetic tests to a real integration in the existing environment. The data, which needs to be processed, must be pushed from a specific PCI-E device (now simply called ‘board’) to a CUDA enabled GPU. There are mainly two ways to do that:
First one:
DMA DMA
board --------> RAM --------> GPU
Second one:
DMA
board --------> GPU
The first one should be possible without big trouble. The CUDA application allocates some memory with cudaMallocHost() and the pointer/address is used by a driver which initializes the DMA via programmed I/O. The pointer just needs to be transferred to the driver and mapped to a physical address. But the host RAM would be a bottleneck.
The second one is more complicated, but without the bottleneck of the RAM. The board has DRAM which must be mapped in the address space of the host. Then the address’ of the mapped DRAM must be used by cudaMallocHost(), respectively by cudaMemcpyAsync(). But is this possible, is there any way to accomplish that? Has anyone tried to do this (successful or not)?
I would be happy to get some hints or comments! Thanks in advance.
Sorry for pushing the thread, but has nobody tried to transfer data directly into the gpu from a self developed device? I cannot believe that we are the only ones who are trying to do that…
Well, since NVIDIA itself has yet to enable direct Tesla to Quadro transfers without using host memory, I doubt that anyone else has succeeded. It’s obviously something people want, though…
Maybe, just maybe NVIDIA is finally actually putting their ideas to code and Tim will jump in this thread and drop one of his subtle hints about upcoming features… but I wouldn’t hold your breath. CUDA 2.2 completely revamped the way pinned memory is handled in the driver and it didn’t even bring this feature.
I used the forum search function and did not find interesting topics. Google is much better, so sorry for opening a new (probably) worthless thread. But maybe someone who implemented the first possibility (with 2 DMA transfers) can say somethink about the achieved bandwidth/latency…