The two CUDA devices share the PCI-Express bandwidth through a switch. If only one of the two devices on the card are transferring, that device can use the full bandwidth like a single device card. If both devices transfer simultaneously, then each device only gets half the bandwidth.
Each CUDA device in a GTX 295 has 240 stream processors.
There is still no mechanism in CUDA to copy data from one GPU directly to another one. You have to copy the data from GPU #1 to the host with host thread #1, then copy from host thread #2 to GPU #2. To make this as fast as possible, you should declare pinned memory to be portable between threads. This requires calling cudaHostAlloc() with the cudaHostAllocPortable flag. (See CUDA 2.2 release notes.)
There is no synchronization between the two devices. They are treated like two completely independent cards that share a PCI-Express slot.
None that I’m aware of.
No, as long as you match the compute capability (1.3 for GTX 200 cards), clock rate, # of stream processors, and memory bandwidth, there should be no performance difference.