Multi devices in single machine. Communication

If several CUDA capable devices are in the single PC does CUDA allow:

  1. Communication between two devices directly via PCIe (to avoid dev1->host and host->dev2 slow communication)

  2. If not, does such support be available on future CUDA versions? From the point of view of single block, threads within it can not communicate with threads from other blocks but all threads from all blocks can read or write to global and texture memory. So blocks from other devices in comparison to those blocks differ only because they can not access gloabal and texture memory from other device.
    Providing functions which use PCIe (or maybe SLI) transfer for inter-device communication could solve problem and bring a lot of benefits for multi-devices platforms like:

  3. Unified memory area with capacity of sum of all memories in all devices

  4. Each device access such memory on the same way and all area is available to each device.

  5. Blocks could be distributed over all devices (without modifying software)

  6. Not necessary recoding application when number of CUDA devices changed

  7. Multi devices are visible to host as one large device with one large memory

  8. Access from host to multi-devices as access to single device (driver should handle and utilize more threads on host if necessary to cover communication from host to multi devices but it should be transparent for users. It means user see memory like single memory area rather than blocks of memory on different devices. Also user should “see” one large device not bunch of them like now in CUDA2.0 )

  9. Writing application then will be easier (and faster)

It seems the current situation is far away from that. Writing application to fully utilize for example 3 or 4 CUDA capable GPUs is real nightmare.

I am interested what others think about such concept.

CUDA kernels cannot initiate DMA. The host always has to do a cudaMemcpy to transfer data to and from GPU. This also means that CUDA kernels cannot send any messages to outside their own GPU.

okay! Unification would involve tremendous hardware changes and library changes and would clog PCI-E bandwidth too – espescially if a kernel is distributed among so many other GPUs. I doubt if some1 would ever do that.