I am currently evaluating the use of CUDA for a low latency real-time signal processing application. As a first step, I measured the time to transfer small blocks of memory from the host to the device and vice versa (the system has two Xeon E5440 processor and a Quadro FX 3700 (G92) connected via PCIe 2.0 x16). Interestingly, the time for transferring blocks up to 1KB is constant (about 10 us) no matter how large the block actually is. Does anybody know whether this is a hardware issue, e.g. because the PCI bus might always transfer at least 1KB? If its a software issue, is there a way to configure the minimal size of a block to be transferred, e.g. using the Driver API instead of C for CUDA? Thank you very much in advance for your help.