Mapping PCIe memory in user-space Mapping video memory in user-space to avoid DMA transfers

Hi All,
I would like to use CUDA and NVIDIA cards to build a research prototype about efficient data transfers mechanisms between CPUs and accelerators (in this case NVIDIA cards).
My very first step is to map the video card memory in the Linux kernel-space or in the user-level address space.

I have a small Linux module that gets the BAR areas from the pci_dev structure for the NVIDIA card. I’ve noticed that there are three different configured areas, being one of them 255MB large. However, my test card is supposed to include 768MB of memory (First guess: there is some kind of register to select which memory area in the card is mapped to the PCI range). Anyway, I can map that PCI memory area and read/write whatever I want from/to it. For instance, I can have an application writing an signature to different addresses (using cudaMalloc and cudaMemcpy) and reading that signature from the PCI memory I’ve mapped to the kernel address space (Problem: it only works when cudaMalloc() returns a low-address; Guess: if the application requests a large piece of memory, the NVIDIA driver returns device memory out of the mappeable range). Another really cool experiment is to write a piece of text to the PCI mapped memory (say the first four sentences from Don Quixote) and reading them using cudaMemcpy. In this case, the application has to call to cudaMalloc before doing the DMA transfer. I’ve had so much fun doing this :-).

A couple of really ugly hacks come to my mind to allow me to avoid calls to cudaMemcpy, but it would only work for certain benchmarks. Without thinking too much, I’ve also thought about some ways to know how to switch memory range that is being mapped to the PCI bus (if this is possible). I am completely sure that I can have so much fun out of this… even by 2012 I might be able to have something usable. Anyway, I would prefer NVIDIA guys giving me some help.

Is there any Linux kernel-level API? (Ok, I’ve done and objdump of the nvidia.ko, so I already know the answer is no). Would it be possible to have a linux kernel-level API?
It would be nice to be able to: (1) Allocate video memory and get both. PCI address and device address. If there is a register to switch the video memory mapped to the PCI bus, I would also need some sort of call to switch from one memory range to a different one. (2) Call a given kernel in the video card. Right now I do not really need this feature, but as soon as I move to GPU kernel scheduling (I’ve already have a user-level ugly hack to do this) I will need calling GPU kernels from Linux kernel - level.

Best,
Isaac

A very interesting read :) I have however one question: in which way is you scheme more efficient then overlapping kernel computation and cudamemcopy as is possible now? How is this more efficient then a DMA transfer?

Apart from that I can imagine that it is just plain fun to do these kinds of things, if only a day had some more hours ;)

Good question :-). To make it short, avoiding DMA transfers produces two main benefits: (1) system performance and (2) programmability. Now I will make it long:

(1) System performance: a DMA transfer requires (more or less) two more times the accesses to main memory than using a direct mapping and accessing main memory is really slow. Also notice that a DMA transfer is likely to require a high instant bandwidth of the PCIe bus and the memory controller. If you use direct mapping you are actually overlapping the communication and the computation in the CPU implicitly. We have a paper published at ICS’08 (http://ics08.hpclab.ceid.upatras.gr) where we use a simulator to show that, if some tricks are used, the total execution time is reduced. Of course, these are simulation results, so whatever similarity with reality is by chance. Now we would like to use actual hardware to test our hypothesis.

(2) Programmability: double-buffering is painful. You usually have to modify your code in really ugly ways to implement double buffering (which I assume is the way you overlap communication and computation). In my opinion, it would be nicer to just porting a sequential kernel to CUDA without modifying your algorithm to allow double-buffering. Double-buffering is also system dependent: you have to tune the size of the buffer to match computation and communication time. If you use a different memory controller, different memory hierarchy, etc. the optimal buffer size will be different. The DMA interface offered by CUDA only allows GPU kernels to get parameters by-value (in other words: it does not allow by-reference parameter passing). Because of this limitation, if your GPU kernel accesses scattered data you have to do a marshaling process prior transferring the data (which harms performance, by the way). Another problem of not supporting by-reference parameter passing is that you can not play tricks with pointers to speed-up your algorithm (for instance, having an array of pointers to cache elements often accessed). Again, you have a more detailed explanation in our ICS paper.

Anyway, the actual reason why I would like to have this kind of support is to actually check whether avoiding DMA transfers (cudaMemcpy) using memory mapping is a good idea or not. My current guess is that it will be beneficial, but I would prefer having experimental data to support this opinion ;-).

Best,

Isaac

Hi Isaac,

Nice topic. I want to further investigate on this.

Were there any advances in the mean time? Please give me a brief report on your status.

e.g. do you see any possible way to access kernel space memory or is it lost time?

Hoping to hear from you soon,

Greetings,

Reinhard