The basic approach is to use CUDA streams and asynchronous copies, possibly in conjunction with double-buffering. However, this is unlikely to be efficient if you are moving the data in 400 byte chunks. PCIe uses packetized transport which means high overhead for small transfers. Full throughput typically means using blocks of >= 8 MByte.
The amount of memory allocatable through cudaHostAlloc() is a function of the underlying operating system calls. cudaHostAlloc() is basically just a thin wrapper around those. Since pinned memory is allocated in physically contiguous chunks, allocation can be affected by fragmentation in the operating system allocator (meaning more pinnable memory may be available, just not in the size you are currently requesting).
Pinning a large-ish percentage of the system memory is usually not a good idea, as operating systems are designed with memory paging in mind.
Note that the performance advantage of pinned host memory vs regular pageable memory has diminished since CPU designers started supporting quad-channel DDR4 which delivers >= 60 GB/sec memory throughput. So the first thing you might want to do is check is whether the use of pinned memory is definitely necessary.