Issue with cuda pinned memory on Tegra K1(XiaoMi pad)

Hi all,

I meet a problem when using cuda pinned memory(created by cudaHostAlloc or cudaMallocHost) on XiaoMi Pad. (XiaoMi Pad use Tegra K1). The performance of copy data from pinned memory to paged memory is so bad!!
I did some tests. The detail step described as below:
Test1:

  1. use malloc to allocate a paged memory
  2. use cudaHostAlloc or cudaMallocHost to allocate a pinned memory
  3. copy a 4704000 bytes data from paged memory to pinned memory, cost about 7ms
  4. copy same size buffer from pinned memory to paged memory cost about 95ms
    paged memory->pinned memory about 7ms
    pinned memory->paged memory about 95ms

Test2:
Copy 4704000 byte size buffer between two paged memory on Xiaomi Pad(Tegra K1)
paged memory → paged memory about 7ms
paged memory ← paged memory about 7ms

Test3: same application on GTX650:
paged memory->pinned memory about 1.9ms
pinned memory->paged memory about 1.9ms
paged memory → paged memory about 1.9ms
paged memory ← paged memory about 1.9ms

My question is why the memory performance of pinned memory on (XiaoMi Pad)Tegra K1 and GTX650 are so different ? Since paged memory and pinned memory are both on the host, they should have the similar performance on memcpy or cudaMemcpy(HostToHost), just like GTX650

I not sure if it is a issue of Tegra K1 or only XiaoMi pad have this issue. I am appreciate if who can tell me what should i do. I am in hurry to solve this!!

BTW, if anyone who also meet this problem, please let me know.

Thanks

I’ve encountered a similar issue on the Jetson TK1. I’ve asked a similar question on StackOverflow, and later answered my own question. The short answer is that data allocated using cudaHostAlloc() is not cached in the CPU caches. So accesses from the host are really slow. Here’s the link:
http://stackoverflow.com/questions/27972491/cpu-memory-access-latency-of-data-allocated-with-malloc-vs-cudahostalloc-on