GPUDirect RDMA Single PCI-e writes

Hi, for context my aim is to rapidly transfer small chunks of data (< 8 bytes at a time) to the GPU with minimal latency from a third party device. Based upon my current understanding, GPUDirect RDMA allows for the pinning of a region of GPU memory and for its physical address to be accessed through the API. First of all, is this understanding correct? Secondly, I would like to have the third party device send PCI-e writes to the GPU physical address regularly and without any CPU interaction after the initial setup (so as to avoid latency). There would be a persistent kernel constantly polling that region of memory and processing the data when it arrives. Is this possible or does anyone see potential problems with this approach?

Tiny transfers across PCIe result in pretty low throughput, in the single-digit MB/s. This fairly recent VMware blog entry shows some data that looks very plausible to me (figures 4 and 8).

https://blogs.vmware.com/apps/2018/06/scaling-hpc-and-ml-with-gpudirect-rdma-on-vsphere-6-7-part-2-of-2.html

Thanks for the response. A low throughput, even in the order of MB/s, is not a massive concern for this application as the GPU will be performing parallel computations on the same small chunk of input data. What is more of a concern is getting the absolute lowest latency possible. This is where initiating a full DMA transfer by the CPU for each chunk of data takes too long. Instead, I believe that a direct PCI-e write from a third party device to the GPU directly every few microseconds is a better fit for this problem if it is indeed possible.