copy in multi GPUs

kolonel · February 5, 2009, 10:42pm

Hi all,
In multiple GPU applications or in a s1070, if it is possible to copy a vector or matrix from one GPU to another GPU directly?? (i.e. without communication with the cpu)
Thanks.

MisterAnderson42 · February 5, 2009, 11:21pm

Unfortunately, the answer is no :(

kolonel · February 5, 2009, 11:48pm

Really?! :(((

Suppose there are 4 GPUs working in parallel, and at each time step they need to exchange their computed vector variables…what should we do?! cutting the CPU threads and copying all GPUs required data on the CPU’s RAM and again copy from CPU to the GPUs and again creating CPU threads…?!

anyways, thanks for reply…

MisterAnderson42 · February 6, 2009, 12:04am

It’s even worse than you think: when you allocate pinned memory for fast transfers, it is only fast to the one GPU in the context you made the allocation in. So you get fast transfers for GPU1->host, but then slow transfers for host->GPU2 host->GPU3 and host->GPU4.

You’ll need to have 1 thread per GPU anyways. The only way is have each thread copy their region of the GPU memory to the host, sync, and then copy out what the other threads copied in.

It’s interesting that you say “each time step they need to exchange their computed vector variables” because this is exactly what I need to do in HOOMD. It is a very slow operation. You can only hope that your processing time on the GPU is so long that the time to transfer data is insignificant.

If you are curions: HOOMD is about 1.4x faster when run on 2 GPUs. And it isn’t worth running on more than two.

kolonel · February 6, 2009, 12:26am

I am Collapsed!! :(
When I am running 2 or 4 GPUs the elapsed time just for cutting the threads is too much! I don’t know what to do with that!
If your OS is windows or Linux? Mine is windows and I’m using WaitForMultipleObjects command to cut the threads, as it’s used in the NVIDIA CUDA SDK code samples…are you using this command?
Thanks.

MisterAnderson42 · February 6, 2009, 12:58am

I run both windows and linux. And I’m running boost threads for synchronization using condition variables. I don’t know what it uses internally for synchronization but the cost is tiny: only a few microseconds tops.

kolonel · February 6, 2009, 3:28am

microsecond?wow!!!

I didn’t know about boost…is there any template or example code you could share that uses this library with multiple GPUs?

tmurray · February 6, 2009, 2:30pm

Fixed in 2.2–pinned memory can be shared across contexts. Thanks, driver team!

kolonel · February 6, 2009, 4:49pm

Could you please explain this more? :blink:

E.D_Riedijk · February 7, 2009, 6:45am

That is very good news!

MichaelChampigny · February 7, 2009, 4:24pm

Thanks!

I’ve been waiting for this feature for months! I’m allocating a CUcontext in one thread (which allocates page-locked memory) but I need to transfer it to a device bound to another thread. The ability to decouple CUcontexts and pinned memory regions is a critical capability for me.

Have this fixed prompted a change in the context management API? I don’t really care for the stack-based management.

ETA of the beta?

_Big_Mac · February 7, 2009, 4:34pm

I believe this means that when you allocate host memory in page-locked mode (ie. using cudaMallocHost), the D<->H transfers will be fast for all cards in the system - each card will be able to issue a DMA to the same host memory buffer. Till now, there was one card per one page-locked buffer, so sharing data meant copying inside host’s memory or using pageable memory (which, under the hood, also resulted in internal copying).

tmurray · February 7, 2009, 5:51pm

No, although cuCtxPopCurrent(&ctx) now returns the context that was just popped instead of the context that is now active (which is to say, now its behavior makes sense).

You have to manually set pinned regions to be shared across specific contexts because the CUDA driver is a user-mode driver that uses a lot of thread-local storage, but it’s very straightforward.

MisterAnderson42 · February 7, 2009, 6:33pm

Not a problem. But will this be doable from the runtime API, or one of those driver API perks?

Topic		Replies	Views
pinned memory with multiple GPUs CUDA Programming and Performance	4	2578	April 12, 2008
Multiple GPU computing CUDA Programming and Performance	8	7882	May 7, 2008
Multiple GPU's and sharing memory Will a CUDA API eventually be provided for this? CUDA Programming and Performance	4	16499	June 28, 2010
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9595	January 1, 2009
Concurrent copy and kernel execution: Yes with 6 copy engine(s)---How to make full use of it? CUDA Programming and Performance	8	4449	May 5, 2022
Questions for multiple GPUs CUDA Programming and Performance	8	7166	April 20, 2009
A few general questions... CUDA Programming and Performance	2	3070	October 12, 2009
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9733	September 22, 2007
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4874	April 8, 2009
Data transfer between multiple GPUs How to do it fast ? CUDA Programming and Performance	4	2547	January 21, 2010

copy in multi GPUs

Related topics