GeForce GTX 690 - dual and independent DMA engines?

I’m developing a raycaster (using OpenCL, but it shouldn’t make a difference) and have a GeForce GTX 690 card with its dual GPUs in my system. So far I’ve only been using one device, but now I’m looking to split half my rendering to each device in such a way that each device has the same data and shoots half the rays. My bottleneck seems to be host-side code right now, and the DMA (using PBOs and textures) transfer of the data from host to device is fast and can be done asynchronously.

My question is this: Does each of the devices work independently of the other? If I initiate the same DMA transfer to both of them using the same host data (from a PBO to texture), will they work on their own and finish their transfers simultaneously just as fast as using only one device? I would assume that, but I’m not sure if the devices cooperate somehow.

Thanks!

In CUDA, the two devices act independently, so I assume the same is true in OpenCL.

The underlying GTX 690 hardware has two separate GPUs with separate device memory connected via a PCI-Express switch to the card connector that plugs into the motherboard. As a result, the two devices must share bandwidth when communicating with the host memory. Because the multiplexing is done with a switch, if one GPU is transferring data and the other GPU is not, the first GPU can use 100% of the host<->device bandwidth. But if the two GPUs transfer simultaneously, then each GPU only gets 50% of the available bandwidth. Additionally, you can quickly become constrained by the amount of host memory bandwidth when trying to feed data to multiple fast GPUs.

Unfortunately, the actual performance of your system is going to be hard to predict due to the variety of motherboard/CPU/memory factors that influence the net bandwidth. Your best bet is to write a small benchmark program and see how it runs.

Thanks, that was what I suspected. As of now, the bottleneck is indeed on the host side (copying new data from one place in RAM to the PBO). Since kernels can run asynchronously, the result is going to be a tradeoff between kernel speed and transfer speed. In my case, probably between more pixels (heavy on kernels) vs framerate (heavy on the buses). I’ll try to write something that makes it easy to switch between strategies and benchmark them!

Something that would allow me to upload half the data to each kernel would be optimal.

Adding to what seibert has already said:
If your PCIe is anything less than the PCIe 3.0 x16 connection between the two GPUs on the GTX 690, transferring data from the GPU that has already received it to the other one will be significantly faster than transferring it from the host again.

Follow-up: Assuming for now that the data will be exactly the same for both cores. is there a quick way to transfer the data between the cores without using PCIe?

EDIT: Exactly what tera just suggested :)

cudaMemcpyPeer()