I’m having a hard time understanding how to transfer optimally between host and device memory.
For example:
if I do a enqueueMemWrite with a malloc’ed buffer, what does happen exactly ? (and why doesn’t it return immediatly ?)
What about if it’s from a buffer alloced with MEM_ALLOC_HOST_PTR ? (and again, why doesn’t it return immediately ?
What happens when I alloc such a buffer with MEM_ALLOC_HOST_PTR, what happens, what memory is used exactly ? Why does the profiler reports a DeviceToHost transfer ?
When I wrote a device driver for a custom built device, whenever we wanted data transfer:
the userspace would just give a userspace pointer to a buffer to the kernel driver (a normal malloc’d buffer, nothing special about it)
The kernel driver would then use get_user_pages to temporarily lock the pages in physical memory
Construct a scatter gather list
Send that list to the device
The device them DMA the data to it’s local memory
When done, interrupt occurs and we unlock the pages with page_cache_release
So, I’m having a hard time understanding why a device so advanced as a nvidia gpu card can’t use a similar scheme and why “pre-pinned” memory should be used for DMA.
Can someone enlighten me on what I’m missing ?
Ok, I got the answer about the requirement for pinned memory in another thread.
But I’m still unclear about why a call to enqueueMemWrite using a pinned buffer isn’t immediate.
For 32 Mo, it takes 35 ms without a pinned buffer and 10 ms with a pinned buffer. It’s faster but not quite immediate (and yes, blocking arg is set to FALSE). And the time taken is proportional to the size of the buffer.
Ok, I got the answer about the requirement for pinned memory in another thread.
But I’m still unclear about why a call to enqueueMemWrite using a pinned buffer isn’t immediate.
For 32 Mo, it takes 35 ms without a pinned buffer and 10 ms with a pinned buffer. It’s faster but not quite immediate (and yes, blocking arg is set to FALSE). And the time taken is proportional to the size of the buffer.