Fast processing of large amounts of pinned memory

yuriy.natarov · August 28, 2017, 3:15pm

I want to use a lot of ram in my computing.

How can I effectively organize a pipeline of reading from the pinned memory and GPU processing of ~400byte blocks? I.e. fast chunked data getting and minimizing time spend on waiting for a new piece of bytes. Maybe some async copy operations or manual cache management.

And how much memory can be allocated through cudaHostAlloc? For some reason I can allocate only about 7gb when more memory is avalible on the host.

njuffa · August 28, 2017, 6:38pm

The basic approach is to use CUDA streams and asynchronous copies, possibly in conjunction with double-buffering. However, this is unlikely to be efficient if you are moving the data in 400 byte chunks. PCIe uses packetized transport which means high overhead for small transfers. Full throughput typically means using blocks of >= 8 MByte.

The amount of memory allocatable through cudaHostAlloc() is a function of the underlying operating system calls. cudaHostAlloc() is basically just a thin wrapper around those. Since pinned memory is allocated in physically contiguous chunks, allocation can be affected by fragmentation in the operating system allocator (meaning more pinnable memory may be available, just not in the size you are currently requesting).

Pinning a large-ish percentage of the system memory is usually not a good idea, as operating systems are designed with memory paging in mind.

Note that the performance advantage of pinned host memory vs regular pageable memory has diminished since CPU designers started supporting quad-channel DDR4 which delivers >= 60 GB/sec memory throughput. So the first thing you might want to do is check is whether the use of pinned memory is definitely necessary.

yuriy.natarov · August 29, 2017, 11:08am

Thanks for the reply,

I use the pinned memory mainly because it can be mapped and used directly from the GPU code, so it is fast and no additional copy operations are required. It is like adding some slower ram to the device. Is it okay to use it in such way?

Regarding a memory copying, 400B is actually 100 32bit integers. So, will it be more efficient to precopy the data from host memory (to some faster mem than global memory?) instead of accessing the integers directly? And the problem is that I don’t know which 400B will be needed in the next step, so the reading from memory is quite random. I can’t read several blocks in advance to access them later.

Topic		Replies	Views
check for cudaHostAlloc Portable possibility CUDA Programming and Performance	13	2764	July 1, 2015
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	376	March 28, 2024
Max amount of host pinned memory available for allocation CUDA Programming and Performance	8	7425	February 4, 2021
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3116	September 16, 2010
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13051	May 4, 2018
Why i can't use my full PCI-Express bandwidth? CUDA Programming and Performance	7	4909	December 17, 2020
Pinned memory size problem CUDA Programming and Performance	4	3856	December 11, 2009
Pinned memory concept - windows driver CUDA Programming and Performance	0	1493	January 20, 2012
cudaHostAlloc performance is slow CUDA Programming and Performance	1	1127	June 26, 2012
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1153	July 10, 2020

Fast processing of large amounts of pinned memory

Related topics