If I remember correctly the CARMA kit has 4x PCIe 2.0 lanes right? So that would at least explain a throughput drop, but one could argue that the latency should still be the same or similar.
I was able to cross check the numbers of latency with bandwidth. For PCI Gen 1 x4 the maximum bandwidth you can acheive is 500MBytes per second unidirection so the latency numbers looks similar and matches with the bandwidth number.
Other problem which I observe is that time for cudamemcpy for D2H ( Device to Host) is always higher than H2D ( Host To Device ) also the variance if I run the experiment for say 1000 times is more in D2H as comapred to H2D.
This I have observed not only on CARMA board but also on normal x86 workstation. Did anybody find something similar?
IIRC, I don’t think pinned memory in CUDA is functional on CARMA. You should be able to confirm this by running the deviceQuery app and checking what the output for “Support host page-locked memory mapping.” Alternatively, you can call cudaGetDeviceProperties() and check the value of “canMapHostMemory” in the cudaDeviceProp struct. mlockall() appears to work just fine though, so perhaps there is some limitation in the CUDA runtime or GPU driver on ARM.