Direct access to the GPU memory over PCIe?


In order to reduce the overall data transfers, I would like to directly transfer data from one PCIe card to a Tesla K20 card using PCIe memory write requests. From the documentation, I understand that I could use the DMA engine of the Tesla K20 but, for some reasons, I would prefer to have the PCIe card initiating and taking control of the data transfers.

So I have some questions:

  • Is it possible to directly access the GPU memory through the PCIe link?
  • How can I get the address that I could use on the PCIe link to write in the GPU memory?

I had a look to the past posts of this forum but I havn’t found any answer to this question.

Many thx for your help.

This may be of interest:

Given a situation where a Tesla GPU is active and has a couple kernels constantly running and receiving data via GPUdirect via the PCIe link.

Lets assume that two small kernels(total of 512 threads) are doing continuous copies of the GPUdirect input data to another device memory buffer while there is a third kernel actively running waiting for a global flag to be set to indicate that a device buffer is full with new data.

When the global flag has been set that third kernel launches via Dynamic Parallelism a ‘child’ kernel which then reads from that newly filled device input buffer and writes (primarily via atomic operations) to a another exclusive device memory space.

So at some points all three distinct kernels may all be doing independent global memory operations, though never writing/reading to/from the same locations, rather waiting until buffers are updated.

All three active kernels have different stream ids. This is all being done without any interaction from the host CPU.

My limited experience with this scenario is that there are points where the relatively large child kernel(millions of threads) starts to lag behind when the two active memory kernels are also trying to update their buffers.

So my questions are:

  1. does the child kernel need a different stream ID than the parent?

  2. In such a real-time situation is there another approach which would make better use of the resources which achieves the objectives of reading data via GPUdirect, updating the current buffer(s), and then launching a large (in terms of the number of threads all of which do a large number of atomic writes) kernel via Dynamic parallelism?

“This is all being done without any interaction from the host CPU.”

Do you have a specific GPUDirect RDMA implementation in mind? If so, some mention of it might be useful.

The canonical implementation of GPUDirect v3 (RDMA) today is via CUDA-aware MPI (using Mellanox IB or RoCE). In this implementation, GPUDirect transfers do not happen without some activation by a piece of host code, somewhere. That host code initiates a transaction which has a finite extent, via a call such as MPI_Send (using suitable device buffer pointers, possibly for both source and destination).

  1. does the child kernel need a different stream ID than the parent?

The child kernel does not need a separate streamID from the parent kernel, in order to run asynchronously to the parent. All kernel launches are asynchronous, with respect to the calling environment. Since the calling environment has a unique streamID compared to the other 2 “main” kernels in question, a separate streamID should not be necessary. Given your description, I’m not sure why it would matter anyway.

Regarding your second question, I can’t really answer it. Certainly all 3 kernels must share the memory bandwidth to main memory, and it’s also not clear to me why you talk about large numbers of atomic writes in the child kernel. Is that somehow related to the overall flow described here, or just an unrelated characteristic of the processing going on in that kernel?

Thanks. This is a new project so I am just getting my head around the details. There is no MPI involved(that I have seen so far in the supporting code), but I need a bit more time with the project to get a better idea of the specifics. The Tesla GPU is receiving data directly from a FPGA setup.

The ‘child’ kernel really is the main workhorse of the pipeline. That kernel reads in the input buffer in coalesced fashion from global memory(without reading from the same location more than once), but the multiple atomic writes are to somewhat random locations(dependent on the input data which cannot be known ahead of time).

The other 2 kernels serve sole purpose of suppling the workhorse kernel with the input data it needs in the correct format/layout.

When I reduce the workload (and size) of the ‘memory buffer update’ kernels, the workhorse kernel is able to finish its job in time to switch to the next buffer. So it is a matter of trying to optimize all active work in order keep up with the flow of input data.

From what I have seen so far the CPU involvement is minimal because there is a data coming in at a rate which cannot be managed and transferred over the PCI-e host-device bus in time. After the current data set (which is large) is processed then the output data is pushed out to the host.

Hi, I realise that this is an old post but it is very relevant to some work that I’m doing now. My goal is to transfer small chunks (< 8 bytes at a time) of data rapidly (every few microseconds) into a location on global memory from a third party device. I need the latency to be as low as possible and, as such, I want to completely remove any host interaction from the process. This includes its involvement in initiating DMA transfers. Instead, I want my third party device to form and send PCI-e write packets to transfer the data directly into global memory on the GPU at regular intervals. From what I currently understand, this requires the use of the GPUDirect RDMA technology. And, from my understanding, this technology allows for a region of GPU memory to be pinned and for its location to be accessed through a CUDA API.

First of all, is this understanding correct? Secondly, does this mean that it is possible for a third party device to repeatedly initiate writes to the same location on GPU memory with no additional overheads?

I am aware that the repeated sending of small chunks of data massively decreased my throughput but that is not a priority for this task.