How does the copy back & forth from GPU work ? (any nvidia tech in the audience ?)

smunaut · August 24, 2010, 8:39am

Hi,

I’m having a hard time understanding how to transfer optimally between host and device memory.

For example:

if I do a enqueueMemWrite with a malloc’ed buffer, what does happen exactly ? (and why doesn’t it return immediatly ?)
What about if it’s from a buffer alloced with MEM_ALLOC_HOST_PTR ? (and again, why doesn’t it return immediately ?
What happens when I alloc such a buffer with MEM_ALLOC_HOST_PTR, what happens, what memory is used exactly ? Why does the profiler reports a DeviceToHost transfer ?

When I wrote a device driver for a custom built device, whenever we wanted data transfer:

the userspace would just give a userspace pointer to a buffer to the kernel driver (a normal malloc’d buffer, nothing special about it)
The kernel driver would then use get_user_pages to temporarily lock the pages in physical memory
Construct a scatter gather list
Send that list to the device
The device them DMA the data to it’s local memory
When done, interrupt occurs and we unlock the pages with page_cache_release

So, I’m having a hard time understanding why a device so advanced as a nvidia gpu card can’t use a similar scheme and why “pre-pinned” memory should be used for DMA.
Can someone enlighten me on what I’m missing ?

smunaut · August 25, 2010, 7:58am

Ok, I got the answer about the requirement for pinned memory in another thread.

But I’m still unclear about why a call to enqueueMemWrite using a pinned buffer isn’t immediate.
For 32 Mo, it takes 35 ms without a pinned buffer and 10 ms with a pinned buffer. It’s faster but not quite immediate (and yes, blocking arg is set to FALSE). And the time taken is proportional to the size of the buffer.

smunaut · August 25, 2010, 7:58am

Ok, I got the answer about the requirement for pinned memory in another thread.

But I’m still unclear about why a call to enqueueMemWrite using a pinned buffer isn’t immediate.
For 32 Mo, it takes 35 ms without a pinned buffer and 10 ms with a pinned buffer. It’s faster but not quite immediate (and yes, blocking arg is set to FALSE). And the time taken is proportional to the size of the buffer.

noah_r · August 30, 2010, 5:31pm

Were you able to resolve or figure out why the non-blocking enqueue calls were taking so long?

Also, can you point me to the other forum answer about pinned/non-pinned you referred to?

Thansk.

-Noah

noah_r · August 30, 2010, 5:31pm

Were you able to resolve or figure out why the non-blocking enqueue calls were taking so long?

Also, can you point me to the other forum answer about pinned/non-pinned you referred to?

Thansk.

-Noah