Issue with cudaHostAlloc on Xiaomi pad

I use CUDA on Xiaomi pad, operate on the CPU memory which allocated by cudaHostAlloc(cudaHostMapped flag) is slower than the same code operate on malloc memory. But on GTX650 the time is nearly same.
The result data:

Operation: 3x3 convolution filter
Buffer size 2240x1080

Xiaomi pad:
malloc: 25ms
cudaHostAlloc(cudaHostMapped flag): 1293ms
GTX650:
malloc: 0.55ms
cudaHostAlloc(cudaHostMapped flag): 0.55ms

Who knows why on Xiaomi pad the time is so different?