I am using a GPU (Tesla for the time being) for radar data processing. In this scenario some bespoke hardware, a PCIe plugin card captures and accumulates radar data. Once a certain amount of data is accumulated (4MB), it has to be transferred to the GPU for processing (every 8ms). The challenge is both to get the data onto the GPU and to carry out the algorithms in a timely fashion.
Ideally I would like to trigger the data transfer between the data capture card onto the GPU directly and be notified once the transfer has completed. So the CPU only orchestrates but is not involved in data shuffling. I understand that there is no such support in the NVidia driver for peer to peer data transfer so far, but could there be? I could imagine that many applications face a similar problem, and it is simple not efficient nor is it elegant to transfer all data to RAM first.
Failing a peer to peer data transfer, my driver for the data capture card transfers data to reserved, page-locked kernel memory, which I can mmap into user space. Now, how do I efficiently copy the data to the GPU?
- host to device copy from the mmaped region. This might be slow.
- copy all data from the mmaped region into a cuda-host-malloced area and then host to device copy.
Is there a more elegant way of transferring data that lives in a page locked memory area?
Could I tell the nvidia driver to use my mmapped region as if it was cuda-host-malloced?
Last, what if I would like to do all that from the kernel? Could I talk to the GPU from a kernel module, i.e. is the API available in kernel space? Could I launch kernels from a kernel module?
My aim is to write a as-real-time-as-it-gets application that transfers data from the data capture card to the GPU and launches a few kernels every 8ms. It won’t be much code, but it has to execute in a deterministic fashion. That’s why I would prefer to stay in the kernel, synchronised to the data capture card.
Many thanks for any hints.