We are currently having a Xilinx FPGA evaluation board (PCIe device), a Window-10 PC (host), and a Geforce GTX 660 card (GPU). What we want to do is to transfer a large block of data (~200 MB) from the device to GPU. We have developed a Windows device driver that allows FPGA to transfer data to host memory via DMA. Of course, we can then transfer data in host memory to GPU. Note that the host memory herein was allocated with Windows Driver function ‘MmAllocateContiguousMemory’ in kernel space.
However, the host memory allocated is not supposed to be CUDA ‘pinned’ host memory and thus, the transfer from host memory to GPU is expected to be slower (than that for CUDA pinned memory).
Is it possible that we allocate CUDA ‘pinned’ host memory in Kernel Space (driver)? Such that we can accelerate the data transfer.
Or, can we simulate the behavior of CUDA pinned host memory?