Fast processing of large amounts of pinned memory

I want to use a lot of ram in my computing.

How can I effectively organize a pipeline of reading from the pinned memory and GPU processing of ~400byte blocks? I.e. fast chunked data getting and minimizing time spend on waiting for a new piece of bytes. Maybe some async copy operations or manual cache management.

And how much memory can be allocated through cudaHostAlloc? For some reason I can allocate only about 7gb when more memory is avalible on the host.

The basic approach is to use CUDA streams and asynchronous copies, possibly in conjunction with double-buffering. However, this is unlikely to be efficient if you are moving the data in 400 byte chunks. PCIe uses packetized transport which means high overhead for small transfers. Full throughput typically means using blocks of >= 8 MByte.

The amount of memory allocatable through cudaHostAlloc() is a function of the underlying operating system calls. cudaHostAlloc() is basically just a thin wrapper around those. Since pinned memory is allocated in physically contiguous chunks, allocation can be affected by fragmentation in the operating system allocator (meaning more pinnable memory may be available, just not in the size you are currently requesting).

Pinning a large-ish percentage of the system memory is usually not a good idea, as operating systems are designed with memory paging in mind.

Note that the performance advantage of pinned host memory vs regular pageable memory has diminished since CPU designers started supporting quad-channel DDR4 which delivers >= 60 GB/sec memory throughput. So the first thing you might want to do is check is whether the use of pinned memory is definitely necessary.

Thanks for the reply,

I use the pinned memory mainly because it can be mapped and used directly from the GPU code, so it is fast and no additional copy operations are required. It is like adding some slower ram to the device. Is it okay to use it in such way?

Regarding a memory copying, 400B is actually 100 32bit integers. So, will it be more efficient to precopy the data from host memory (to some faster mem than global memory?) instead of accessing the integers directly? And the problem is that I don’t know which 400B will be needed in the next step, so the reading from memory is quite random. I can’t read several blocks in advance to access them later.