I assume you are referring to transfers between the host system and the GPU, which use PCIe interconnect, unless you are one of the lucky few who get to use a PowerPC system with NVLINK.
PCIe is a packetized transport, which also means that there is fixed, per-transfer overhead. Therefore, effective throughput increases with growing transfer size. Maximum transfer rates (~12 GB/sec per direction for a PCIe gen3 x16 link) are typically reached for transfer sizes in the 8 MB to 16 MB region.
Note that while fewer large transfers are therefore more efficient overall than many small transfers, use of larger transfers might have an impact on latency in the context of your application.
You don’t need a paper to assess the transfer rates at different transfer sizes, you can simply measure those yourself. CUDA comes with a sample application called bandwidthTest that you could modify according to your needs.