Not directly. You have to copy the data from device #1 to host, then copy host to device #2. Moreover, in CUDA 2.1 there is no way to mark a block of memory as page-locked for two host threads, which is required to access two GPUs at the same time. This meant either the device1->host or host->device2 copy would be slow. CUDA 2.2 fixes this, and allows multiple host threads to share page-locked host memory. (See next question also.)
Yes, this is pretty much the only way to do it. A given host thread can only be associated with one CUDA device at a time. There is a handy C++ class called GPUWorker which handles the host threads for you:
Comments from NVIDIA employees in the past suggest the SLI link is actually not that fast. PCI-Express, however, is designed to allow devices to directly communicate with each other, so in principle a GPU-to-GPU copy could be done over that link at 3 or 6 GB/sec. This capability is not present in CUDA yet, though people have asked for it.