[SOLVED] cudaMemcpy down to 100Mbyte/s

Hello everybody !

I have a bunch of matrices that I copy to my GTX 1050 Ti for some processing.
Each transfer is about 200Mbyte of pageable memory.

I’m seeing significant differences in the transfer rate.

Some transfer up to 2Gbyte/s (the first ones) and some all the way to < 100Mbyte/s (the last ones), for which the performance hit becomes non-negligible. There are 12 matrices.

For pinned memory, the transfer rate is pretty much fixed at 12GByte/second for all matrices.

Could this be that when those large differences are due to the CPU struggling to find contiguous memory space when the RAM gets around 80% full ?

Thanks and have a nice day !
Norman

Have you checked whether the pageable memory may actually be paged in/out, i.e. thre is some amount of swapping? With various operating systems, swapping data to disk and back starts before the physical system memory is 100% occupied, and may well happen when utilization is around 80%.

Pinning lots of system memory, therefore making it unavailable to the virtual memory allocator, may cause swapping to start at even lower system memory utilization.

Try reducing your system memory usage, or increasing the size of the system memory. Rule of thumb (recommendation) for high-performance GPU-accelerated systems: system memory size ~= 4x total GPU memory size.

Thanks for your input !

It was indeed the problem (i.e. going to 128 Go RAM solved it even when the memory was not pinned). I could also see it pageing in and out when getting filled up. It’s quite surprising how slow it can get, though !

Anyways, all solved, thanks :)

Unless your disk storage consists of high-performance NVMe SSDs, throughput is likely on the order of 100s of MB/sec, and that’s for uni-directional traffic reading physically contiguous blocks.