Why copy data is too slow from mmap() memory aera?

When I develop a PCIE driver on TX2 ,I found a problem when copy data from mmap() memory aera to normal memory(The target is
fast copy data from kernel space to user space):
kernel code:

unsigned long size = vma->vm_end - vma->vm_start;    
    vma->vm_flags |= ( VM_IO | VM_DONTEXPAND | VM_DONTDUMP );
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    if(remap_pfn_range(vma,vma->vm_start,vmalloc_to_pfn(gReadBuffer),size,vma->vm_page_prot))
           printk("vmalloc_to_pfn Error! \n");

user code:

mmap_data=(unsigned char *)mmap(NULL,BUF_SIZE,PROT_READ|PROT_WRITE,MAP_PRIVATE,fd,0);
memcpy(gReadData,mmap_data,BUF_SIZE);

It’s very very slow and about 110MB/s. The copy_to_user() function do the same work and 670MB/s. If I copy same size data between two normal memory, the speed is 11GB/s.

I use mmap() function and copy data from that aera, I want to speed up my copy speed but now the speed is too slow!

Anyone can help me?

Most driver APIs don’t map each buffer as it arrives, because entering the virtual memory protected region and changing the page tables is a very heavy operation. I e, the cost is not in “copying” the data; the cost is in the synchronization.

Instead, driver APIs will first allocate buffers, before any data exchange is made. Then, users are responsible for putting data into buffers (or reading data out of buffers) that already exist, and the driver/user API talks about buffers by “handle” (some ID, typically – a file descriptor per buffer is generally too heavy-handed.)

For examples of how this works, look at the V4l2 driver API, and the OSS (/dev/dsp) driver API; these have been around for a long time, are well understood, have several well-documented drivers, and have programming models where users first allocate/request/map buffers, and then start cycling through them in the API.

@snarky

Thank you for ur advice. I look the sample document “V4l2” which you mentioned and realize my code again. I use dma_mmap_coherent() fun and speed up my copy speed! Now the speed is 670MB/s which same the copy_to_user(),I think non-cached is the bottleneck of the transfer. Now I want to find a more speed method because my PCIE devie’s data speed is 1.4GB/s. Use dma_map_single() may cause better performance?

To all:
I try to use kmalloc() , dma_map_single() and copy_to_user() for my transfer. And now I get 3.5GBps speed for my application. It solve my problem and I suggest use function I mentioned before.
Thanks everyone!