Mapping PCIe memory in user-space Mapping video memory in user-space to avoid DMA transfers

Isaac_Gelado · December 11, 2008, 11:37am

Good question :-). To make it short, avoiding DMA transfers produces two main benefits: (1) system performance and (2) programmability. Now I will make it long:

(1) System performance: a DMA transfer requires (more or less) two more times the accesses to main memory than using a direct mapping and accessing main memory is really slow. Also notice that a DMA transfer is likely to require a high instant bandwidth of the PCIe bus and the memory controller. If you use direct mapping you are actually overlapping the communication and the computation in the CPU implicitly. We have a paper published at ICS’08 (http://ics08.hpclab.ceid.upatras.gr) where we use a simulator to show that, if some tricks are used, the total execution time is reduced. Of course, these are simulation results, so whatever similarity with reality is by chance. Now we would like to use actual hardware to test our hypothesis.

(2) Programmability: double-buffering is painful. You usually have to modify your code in really ugly ways to implement double buffering (which I assume is the way you overlap communication and computation). In my opinion, it would be nicer to just porting a sequential kernel to CUDA without modifying your algorithm to allow double-buffering. Double-buffering is also system dependent: you have to tune the size of the buffer to match computation and communication time. If you use a different memory controller, different memory hierarchy, etc. the optimal buffer size will be different. The DMA interface offered by CUDA only allows GPU kernels to get parameters by-value (in other words: it does not allow by-reference parameter passing). Because of this limitation, if your GPU kernel accesses scattered data you have to do a marshaling process prior transferring the data (which harms performance, by the way). Another problem of not supporting by-reference parameter passing is that you can not play tricks with pointers to speed-up your algorithm (for instance, having an array of pointers to cache elements often accessed). Again, you have a more detailed explanation in our ICS paper.

Anyway, the actual reason why I would like to have this kind of support is to actually check whether avoiding DMA transfers (cudaMemcpy) using memory mapping is a good idea or not. My current guess is that it will be beneficial, but I would prefer having experimental data to support this opinion ;-).

Best,

Isaac

Topic		Replies	Views
CUDA device memory access? CUDA Programming and Performance	11	15718	August 5, 2011
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8136	June 30, 2010
Memory from peripheral devices to GPU DMA directly to another device... CUDA Programming and Performance	6	4178	August 16, 2009
GPU Communication Protocol CUDA Programming and Performance	16	6291	May 17, 2010
How to make host pinned shared memory across process fork(2)? CUDA Programming and Performance	14	5274	January 6, 2015
Method to Cycle steal DMA write into DDR5 CUDA Programming and Performance	7	921	December 8, 2017
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1271	May 14, 2019
From NIC to GPU. CUDA Programming and Performance	40	13655	February 12, 2011
Registering Mapped Linux Character Device Memory with cudaHostRegister Results in Invalid Argument CUDA Programming and Performance	2	2834	June 27, 2017
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6303	January 22, 2025

Mapping PCIe memory in user-space Mapping video memory in user-space to avoid DMA transfers

Related topics