until now we used CUDA to evaluate if the performance of GPUs are good enough to replace an existing FPGA platform. The results are very satisfying and now we want to move on from synthetic tests to a real integration in the existing environment. The data, which needs to be processed, must be pushed from a specific PCI-E device (now simply called ‘board’) to a CUDA enabled GPU. There are mainly two ways to do that:
First one: DMA DMA board --------> RAM --------> GPU Second one: DMA board --------> GPU
The first one should be possible without big trouble. The CUDA application allocates some memory with cudaMallocHost() and the pointer/address is used by a driver which initializes the DMA via programmed I/O. The pointer just needs to be transferred to the driver and mapped to a physical address. But the host RAM would be a bottleneck.
The second one is more complicated, but without the bottleneck of the RAM. The board has DRAM which must be mapped in the address space of the host. Then the address’ of the mapped DRAM must be used by cudaMallocHost(), respectively by cudaMemcpyAsync(). But is this possible, is there any way to accomplish that? Has anyone tried to do this (successful or not)?
I would be happy to get some hints or comments! Thanks in advance.