Mapping PCIe memory in user-space Mapping video memory in user-space to avoid DMA transfers

Good question :-). To make it short, avoiding DMA transfers produces two main benefits: (1) system performance and (2) programmability. Now I will make it long:

(1) System performance: a DMA transfer requires (more or less) two more times the accesses to main memory than using a direct mapping and accessing main memory is really slow. Also notice that a DMA transfer is likely to require a high instant bandwidth of the PCIe bus and the memory controller. If you use direct mapping you are actually overlapping the communication and the computation in the CPU implicitly. We have a paper published at ICS’08 (http://ics08.hpclab.ceid.upatras.gr) where we use a simulator to show that, if some tricks are used, the total execution time is reduced. Of course, these are simulation results, so whatever similarity with reality is by chance. Now we would like to use actual hardware to test our hypothesis.

(2) Programmability: double-buffering is painful. You usually have to modify your code in really ugly ways to implement double buffering (which I assume is the way you overlap communication and computation). In my opinion, it would be nicer to just porting a sequential kernel to CUDA without modifying your algorithm to allow double-buffering. Double-buffering is also system dependent: you have to tune the size of the buffer to match computation and communication time. If you use a different memory controller, different memory hierarchy, etc. the optimal buffer size will be different. The DMA interface offered by CUDA only allows GPU kernels to get parameters by-value (in other words: it does not allow by-reference parameter passing). Because of this limitation, if your GPU kernel accesses scattered data you have to do a marshaling process prior transferring the data (which harms performance, by the way). Another problem of not supporting by-reference parameter passing is that you can not play tricks with pointers to speed-up your algorithm (for instance, having an array of pointers to cache elements often accessed). Again, you have a more detailed explanation in our ICS paper.

Anyway, the actual reason why I would like to have this kind of support is to actually check whether avoiding DMA transfers (cudaMemcpy) using memory mapping is a good idea or not. My current guess is that it will be beneficial, but I would prefer having experimental data to support this opinion ;-).

Best,

Isaac