CARMA Kit latency too high

I am running some initial benchmarks on CARMA kit and found out that the latency is more than I expected. The numbers look something like this:

Elements Transferred : 1 ( float: 4 bytes)
H2D: 114.940798 us
H2D Pinned: 125.577605 us
D2H: 170.166397 us
D2H Pinned: 125.494397 us

Elements Transferred : 4096 ( float: 4 bytes)
H2D: 186.67 us
H2D Pinned: 181.747 us
D2H: 260.294 us
D2H Pinned: 158.3555 us

Where H2D is Host to device and D2H is device to host transfer. All the timings are in us ( micro seconds ) and every element is 4 byte float.

Can anybody confirm on these numbers and why this latency is too high?

If I remember correctly the CARMA kit has 4x PCIe 2.0 lanes right? So that would at least explain a throughput drop, but one could argue that the latency should still be the same or similar.

Can you post some detailed hardware specs?

Hi, CARMA KIT’s PCI is Gen 1, see

Anyway I’ll advise to post on the CARMA dedicated formus here:

I was able to cross check the numbers of latency with bandwidth. For PCI Gen 1 x4 the maximum bandwidth you can acheive is 500MBytes per second unidirection so the latency numbers looks similar and matches with the bandwidth number.

Other problem which I observe is that time for cudamemcpy for D2H ( Device to Host) is always higher than H2D ( Host To Device ) also the variance if I run the experiment for say 1000 times is more in D2H as comapred to H2D.

This I have observed not only on CARMA board but also on normal x86 workstation. Did anybody find something similar?


We are aware of the high latecy issue you pointed out.
We’ll look into this and hope to address it in a future release

@bharatkumarsharma: That latency looks right to me.

Here are some old plots that illustrate the relationship between PCIe width, transfers/sec., payload and throughput.

As you can see a single PCIe 1.1 lane is a pretty narrow pipe!

PCIe 2.0 x8 ::: E8600 + P45 + GTX470 + Win7x64:

PCIe 1.1 x1 ::: Atom D525 + NM10 + ION2 + Win7x64:

(I think “pinned” was actually pinned + write-combined)

IIRC, I don’t think pinned memory in CUDA is functional on CARMA. You should be able to confirm this by running the deviceQuery app and checking what the output for “Support host page-locked memory mapping.” Alternatively, you can call cudaGetDeviceProperties() and check the value of “canMapHostMemory” in the cudaDeviceProp struct. mlockall() appears to work just fine though, so perhaps there is some limitation in the CUDA runtime or GPU driver on ARM.