Does CUDA support broadcast function? New nVida chipset 790i has builtin ....

As I understand new nvidia 790i chipset has support for broadcasting. So in multi GPUs platform memcopy from host->device(s) of same data is possible with only one MemCopy instead doing memcopy to each GPU. Does CUDA supports that?

And second question is:
What is the most efficient way to synchronize memory blocks between more GPUs?

For example, in 3 GPUs platform running some particles simulation where each GPU calculate one third of particles. The first GPU works on the first third, second GPU works on the second third and third GPU works on the last third of particles. After each simulation frame, they need to synchronize their memory blocks (because each particle is dependent on each other). By definition (if nothing has changed in this beta2 version) CUDA requires each device is accessed from different host thread. So it means host thread which communicate with ie. first device can not communicate with others two and so on. That leads to conclusion, after every frame when devices send their partial memory blocks to their own host thread, host threads need to synchronize over shared memory block and then copy proper thirds of block to their devices.
How that can be optimized?
Is it possible host threads update those data while kernels are running?
If kernels must be stopped, does it mean they must be loaded again when have to be started on next frame?

Can someone help

These are very good questions… and that chipset “broadcast” ability is something I had not heard of either… could be VERY useful for multiCPU!

A followup question, maybe even more rare but clearly useful… how about device-to-device NON-PCIE bus transfers? I’m specifically thinking of cases like the 9800GX2 where there’s a direct GPU to GPU pipe right on the card. (Which could even lead to yet more questions about data transfer via SLI connections, even if SLI is disabled for video, those SLI connectors might still be able to communicate anyway, no CPU or PCIE bus needed at all).

Yes all of that is already supported by hardware.

790i chipset has that technologies called PW shortcut and Broadcast already implemented in hardware but I’m not sure why that is not supported by CUDA

This link is info on that…SLI-78932.shtml

Until that come in CUDA I must implement transfers via CPU. Can someone tells where optimizations could be done. Does loaded, stopped kernel in device could be started without reloading and if so how? Can CPU transfer data to device while kernel is running and how to synchronize such operation?

Please help

What I understood is that SLI has low bandwith, so it is not useful to exchange data between GPU’s (It only transfers some pixels from the framebuffer for SLI)

If you read specs. on posted link you will see:

"The PW Shortcut implementation creates a direct path between the GPUs, via the northbridge chipset. The technology is essential for the performance of the SLI system, but at the same time, it offers additional functionality by taking off the workload from the system CPU. The classical SLI system will transfer data from the first GPU to the CPU via the northbridge chipset to the second GPU.

The Broadcast technology allows the system processor to send information to all GPUs of a system with a single packet, just like a radio antenna. All the delivered performance will have a major drawback: excessive heat. The estimations claim that the chipset will need professional cooling, that will make the MSI CircuPipe technology a joke."

So your comment doesn’t stand

I am talking about an SLI bridge.

Yes, but the question here is about PW shortcut and Broadcast

I reacted to the post of SPWorley…

It’s ok Riedijk.

I finally found official paper from nVidia,…_PWShort_TB.pdf

Yes, everything of that is already exists and is implemented in nVidia 790i chipset. Now GPU can initialize communication with another GPU via north bridge DMA controller, since now, it could be done only by CPU. With such improvement speed of transfer between GPUs is doubled. Also if CPU need to send same data to for example 4 GPUs it can now just by broadcasting data with single transfer (4 times faster).

Question for nvidia guys,

when it will be implemented in CUDA? Situations like one from example in the first post, when synchronization in between GPUs is required could be solved more efficiently by this advanced feature to avoid double transfers GPU0->host->GPU1 with just one GPU0->GPU1 transfer.

This is a common request. GPU to GPU data transfers are planned for CUDA 2.1 (but no promises!).

Finally someone talks, thanks!

Just knowing it will be supported (soon or latter) makes my day …