pinned memory with multiple GPUs

koby · April 10, 2008, 4:24pm

Hello all!

I want to build a system with 2 G9x GPUs. Will it be possible to use page-locked memory, so the copies to the two GPUs will be performed asynchronously?

In my program, there will be two different buffers (allocated from page-locked memory) each of which will be copied to a specific GPU asynchronously. Is this going to work?

Thanks in advance!

Chirality · April 11, 2008, 9:33pm

As long as each GPU is controlled by its own thread, yes.

CUDA Programming for multiple GPUs, to me, is more about learning POSIX threads (or whatever the windows version is) than anything.

koby · April 12, 2008, 12:49pm

Thanks a lot for the answer!

So, each thread has to do the allocation of the pinned memory in isolation, in order to perform asynchronous copies or it doesn’t matter?

For example, in my program the allocation of the buffers is performed during the initialization (and before the threads have been created) and then each thread will use its own buffer to do its work on the corresponding GPU. Will the copies of each thread be asynchronous?

I’m asking this because on a different topic in this forum it is mentioned that each thread has to do the allocation in its own, in order for the copies to be asynchronous. Is it true?

Thanks in advance!!

seb · April 12, 2008, 2:58pm

From what I have seen and read here you have to allocated pinned memory in the thread you want to use it. You can use it in any thread however you will not have a performance benefit if you do.
If you want the fast file transfer I think you have to allocate the memory you bind the CUDA context to.

koby · April 12, 2008, 4:25pm

This is actually what I have understood by reading similar posts. However, this behavior seems to me weird.

From what I have learned until now, by programming on a single GPU environment, the page-locked memory address range is tracked by cudaMemcpy-family functions, so they can accelerate copies to/from the device asynchronously.

Why has the page-locked memory be allocated from each thread in order for the memory copies to be asynchronous - assuming that each copy will be performed by a different thread ?