Hi all,
In multiple GPU applications or in a s1070, if it is possible to copy a vector or matrix from one GPU to another GPU directly?? (i.e. without communication with the cpu)
Thanks.
Unfortunately, the answer is no :(
Really?! :(((
Suppose there are 4 GPUs working in parallel, and at each time step they need to exchange their computed vector variables…what should we do?! cutting the CPU threads and copying all GPUs required data on the CPU’s RAM and again copy from CPU to the GPUs and again creating CPU threads…?!
anyways, thanks for reply…
It’s even worse than you think: when you allocate pinned memory for fast transfers, it is only fast to the one GPU in the context you made the allocation in. So you get fast transfers for GPU1->host, but then slow transfers for host->GPU2 host->GPU3 and host->GPU4.
You’ll need to have 1 thread per GPU anyways. The only way is have each thread copy their region of the GPU memory to the host, sync, and then copy out what the other threads copied in.
It’s interesting that you say “each time step they need to exchange their computed vector variables” because this is exactly what I need to do in HOOMD. It is a very slow operation. You can only hope that your processing time on the GPU is so long that the time to transfer data is insignificant.
If you are curions: HOOMD is about 1.4x faster when run on 2 GPUs. And it isn’t worth running on more than two.
I am Collapsed!! :(
When I am running 2 or 4 GPUs the elapsed time just for cutting the threads is too much! I don’t know what to do with that!
If your OS is windows or Linux? Mine is windows and I’m using WaitForMultipleObjects command to cut the threads, as it’s used in the NVIDIA CUDA SDK code samples…are you using this command?
Thanks.
I run both windows and linux. And I’m running boost threads for synchronization using condition variables. I don’t know what it uses internally for synchronization but the cost is tiny: only a few microseconds tops.
microsecond?wow!!!
I didn’t know about boost…is there any template or example code you could share that uses this library with multiple GPUs?
Fixed in 2.2–pinned memory can be shared across contexts. Thanks, driver team!
Could you please explain this more? :blink:
That is very good news!
Thanks!
I’ve been waiting for this feature for months! I’m allocating a CUcontext in one thread (which allocates page-locked memory) but I need to transfer it to a device bound to another thread. The ability to decouple CUcontexts and pinned memory regions is a critical capability for me.
Have this fixed prompted a change in the context management API? I don’t really care for the stack-based management.
ETA of the beta?
I believe this means that when you allocate host memory in page-locked mode (ie. using cudaMallocHost), the D<->H transfers will be fast for all cards in the system - each card will be able to issue a DMA to the same host memory buffer. Till now, there was one card per one page-locked buffer, so sharing data meant copying inside host’s memory or using pageable memory (which, under the hood, also resulted in internal copying).
No, although cuCtxPopCurrent(&ctx) now returns the context that was just popped instead of the context that is now active (which is to say, now its behavior makes sense).
You have to manually set pinned regions to be shared across specific contexts because the CUDA driver is a user-mode driver that uses a lot of thread-local storage, but it’s very straightforward.
Not a problem. But will this be doable from the runtime API, or one of those driver API perks?