How does the copy back & forth from GPU work ? (any nvidia tech in the audience ?)

Hi,

I’m having a hard time understanding how to transfer optimally between host and device memory.

For example:

  • if I do a enqueueMemWrite with a malloc’ed buffer, what does happen exactly ? (and why doesn’t it return immediatly ?)
  • What about if it’s from a buffer alloced with MEM_ALLOC_HOST_PTR ? (and again, why doesn’t it return immediately ?
  • What happens when I alloc such a buffer with MEM_ALLOC_HOST_PTR, what happens, what memory is used exactly ? Why does the profiler reports a DeviceToHost transfer ?

When I wrote a device driver for a custom built device, whenever we wanted data transfer:

  • the userspace would just give a userspace pointer to a buffer to the kernel driver (a normal malloc’d buffer, nothing special about it)
  • The kernel driver would then use get_user_pages to temporarily lock the pages in physical memory
  • Construct a scatter gather list
  • Send that list to the device
  • The device them DMA the data to it’s local memory
  • When done, interrupt occurs and we unlock the pages with page_cache_release

So, I’m having a hard time understanding why a device so advanced as a nvidia gpu card can’t use a similar scheme and why “pre-pinned” memory should be used for DMA.
Can someone enlighten me on what I’m missing ?

Ok, I got the answer about the requirement for pinned memory in another thread.

But I’m still unclear about why a call to enqueueMemWrite using a pinned buffer isn’t immediate.
For 32 Mo, it takes 35 ms without a pinned buffer and 10 ms with a pinned buffer. It’s faster but not quite immediate (and yes, blocking arg is set to FALSE). And the time taken is proportional to the size of the buffer.

Ok, I got the answer about the requirement for pinned memory in another thread.

But I’m still unclear about why a call to enqueueMemWrite using a pinned buffer isn’t immediate.
For 32 Mo, it takes 35 ms without a pinned buffer and 10 ms with a pinned buffer. It’s faster but not quite immediate (and yes, blocking arg is set to FALSE). And the time taken is proportional to the size of the buffer.

Were you able to resolve or figure out why the non-blocking enqueue calls were taking so long?

Also, can you point me to the other forum answer about pinned/non-pinned you referred to?

Thansk.

-Noah

Were you able to resolve or figure out why the non-blocking enqueue calls were taking so long?

Also, can you point me to the other forum answer about pinned/non-pinned you referred to?

Thansk.

-Noah