question on cuda 4.0's multi-gpu capability

Suppose I have multiple GPU’s in a machine and I have a kernel running on GPU0.

With the UVA and P2P features of CUDA 4.0, can I modify the contents of an array on another device say GPU1 when the kernel is running on GPU0?

The simpleP2P example in the CUDA 4.0 SDK does not demonstrate this.

It only demonstrates:

Peer-to-peer memcopies
A kernel running on GPU0 which reads input from GPU1 buffer and writes output to GPU0 buffer

A kernel running on GPU1 which reads input from GPU0 buffer and writes output to GPU1 buffer

Yes, you can.
But be really careful, because while the writes (and reads) will work with multiple running kernels on either or both devices, there’s still no robust synchronization between the two running devices except for finishing kernels and forcing a synchronization on the CPU side at the coarse kernel launch granularity.

The danger is you’ll immediately slide down the slippery slope: you inevitably start coding your own busy-wait polling hacks to fake communication between the two devices, as opposed to a simple read or write of static data. Don’t go there. It’s peril-filled enough when you’re trying to force kernelwide synchronization between blocks on the same device (don’t do that either! tmurray will slap you.)

An example of where I use the multi-GPU write is with one of my particle simulation tools. Two GPUs have between them a partition of all the particles (say one GPU is responsible for all the particles with x<0 and GPU1 has all those with x>1, though actually there’s some overlap, but ignore that for now). Both GPUs do fancy force simulations using their own local copies of all the particles on their “side”. The GPUs know they’re only responsible for the x<0 or x>0 particles respectively. But when a particle is moved to a new location, if that new x is close to 0 (x-d respectively) that particle is COPIED to the other GPU’s memory, even as that GPU is still computing. That GPU doesn’t use the new particle immediately, it’s just getting recorded in a setup array for future use. I also don’t write the particles one by one, but in batches.

Then a CPU sync waits for both GPUs to finish their computes, and start the next time step (on both GPUs). The nice part now is that all the updated particles of interest are already on the GPU, ready to merge into the bigger list of particles… I don’t need a second CPU memory copy and synchronization, adding more latency.

Now in practice, it’s even more complex than this because I do the same trick for multiple kernels on the same device at the same time, and I support N GPUs, but the “write boundary updates to your neighbors” trick is the same. The multiple kernels on one device is to keep the GPU always busy (even with the CPU sync) since while kernel 1 is waiting, kernel 2 is running, so the GPU isn’t idle. The nice part with UVA is the kernel doesn’t know or care if it’s writing to its own GPU or to its neighbor, so the multi-GPU aspect is abstracted away at the memory level anyway.