IDEA: Intrinsic multi-GPU support (Even over a network)

The current model behind CUDA is very convenient. The thread has one GPU context. It is automatically created when needed and destroyed when the thread ends.

But, the current model for using multiple GPUs is not convenient. There need to be multiple threads, each with its own context. A completely different, possibly platform-dependent programming API is required to synchronize between the threads. Efficient memcopies between GPUs are not possible.

With multiple GPUs being the high-performance standard not only in the future (with Moore’s Law’s end), but already in the present, better multi-GPU support is necessary. I don’t know how the context model will be affected or the under-the-hood changes required, but a single CPU thread should able to control multiple GPUs.

I think the rest of the CUDA framework would fit nicely. Asynchronous execution would work as is. For example, simultaneously launching two kernels on different cards would not require new syntax. It would happen, in C, just by calling one kernel after the other. Existing memcpyAsync and CUDA streams functionality would all carry over as well and further reduce synchronizations with the CPU. Synchronization with the CPU needs to occur only for a synchronous host-device cudaMemcpy. Likewise, synchronization among GPUs will only happen when there is a card-card cudaMemcpy.*

Additionally, this model could be scaled across a network. As with multiple CPU threads, using an external API like MPI may be the best choice for many applications. But the capability to talk to GPUs over a network with no additional complexity would further CUDA’s accessibility. Moreover, the efficiency may be pretty good in many cases, since the above asynchronous model should naturally encourage a decent decomposition of the algorithm. The bandwidth over an interconnect like Infiniband can even achieve similar performance to PCIe x16.

*Hopefully, CUDA will switch to 64bit addresses that allow a flat memory model. You would not need to say “memcpy from device 1 to device 2.” You would not even need to say “memcpy from device to host.” The pointer would indicate its location on its own. This would reduce verbosity and bugs. In the future I hope shared memory and constant memory also join the single address space, simplifying the PTX ISA and getting rid of those annoying “unknown pointer” compiler errors.

The way streams would work is this: Within a stream, kernels assigned to different GPUs will run in parallel, and the normal sync-on-cudaMemcpy rules would apply. Between streams there is no synchronization (except for operations on stream 0). Operations on different streams can automatically overlap. Execution can run in parallel with device-host copies and network transfers, hardware permitting. NVIDIA may also have the leeway to modify its driver and allow RDMA transfers using supported adapters (eg ConnectX), even if it does not get around to writing a general-purpose API for doing PCI DMA to its GPUs.

“Moreover, the efficiency may be pretty good in many cases, since the above asynchronous model should naturally encourage a decent decomposition of the algorithm. The bandwidth over an interconnect like Infiniband can even achieve similar performance to PCIe x16.”

Latency should also be considered as a serious issue: many algorithms do need synchronization anyhow, scaling transparently on MPI is not that trivial there :/
Actually, i don’t really see why this should be the job of CUDA to handle such multi-node issues, it should either be let to the programmer by offering means to directly copy data from GPU to some generic pci device (typically a network interface card). I guess we are still not there yet !

Otherwise, i’m also 100% convinced that getting a single context on a thread at a time is a serious limitation, why should it not be possible to copy data from some context directly to another context without doing some terriblly costly memory copy on the host ? On the other end, why should some CPU thread that does not hold the context not be able to do a memory copy either ?
CUDA is not only not really adapted to multi-GPU setups as said by alex_dubinsky, it’s also not adapted to multicore/SMP architectures.

If i agree a flat memory model could be useful, would not a first step be to just extend existing API with some “*_with_context” functions + allowing multiple threads to hold the same context + making possible to have multiple contexts loading at the same time ? If you still have a “default” context, this remains compatible with the existing API. There remains the issue of concurrency among different threads holding the same context … either you consider people are well behaving and possibly offer an explicit lock mechanism, or you do the proper locking by hand.

Those modications in terms of code in the runtime system are certainly quite big, but the impact on the model itself is not so important. Current machines neither being 1 CPU - 1 GPU, nor 1 CPU - k GPUs but rather (clusters of) n CPUs - k GPUs (with n >> k), those issue will have to be considered sooner or later… CUDA is nice and getting nicer everyday, but it’s not adapted to multicore processors at all for the moment.

Thanks for your time,

I’ve often wished we could take this idea one step further, and enable the grid from a single kernel to be split over multiple cards. The two layer design of CUDA really lends itself to the NUMA situation presented by multiple GPUs in a single system. Intra-block communication and synchronization is fast and strongly encouraged, whereas inter-block communication/synchronization is discouraged. (I’m going to put aside global memory atomics, which complicate matters.) Moreover, blocks only need to see a consistent view of global memory at the start of a kernel, not at all times during execution, which vastly simplifies the hardware requirement. If multiple writes are made to global memory, there is no ordering requirement, nor is it even required that these writes be visible to other threads while the kernel is executing.

It would be great if we could create a multi-GPU context that would have the following properties:

  • Global atomics are forbidden (no way around that, I can see).
  • The context can have any subset of CUDA devices present in the system associated with it, which I’ll call the active devices. Initially, one might want to require that all active devices in a context are identical to make block distribution easier (see below), but I could see loosening that eventually.
  • cudaMalloc() allocates identical memory regions on all active devices. The returned pointer is either identical for all devices, or some kind of hidden translation is used to associate the returned pointer with the memory location on each device.
  • Host-to-device and device-to-device memory copies are automatically replicated to all active devices so that each device’s memory is kept consistent with the others.
  • Similar behavior for CUDA functions which manage textures and constant memory.
  • When a kernel is called, the blocks in the grid will be distributed over the active devices in some fashion. If the devices are identical, this is trivial. If not, things are more complicated.
  • Before another kernel call or device-to-host cudaMemcpy can be processed, the global memory on the active devices must be synchronized. I’m almost certain this is going to require some hardware magic: dirty bits on global memory locations and direct GPU-to-GPU communication over the PCI-Express bus. (Or maybe over those SLI connectors? Anyone know what information those even carry?) In the situation where a single global memory location has been modified on multiple devices, current CUDA rules apply: one of the writes will arbitrarily succeed.
  • Any of the existing hacks to enable inter-block synchronization should be avoided as they are guaranteed not to work.

So far all of the CUDA kernels I’ve had to write would “just work” in this kind of system with no code changes. The scaling would even be very good, assuming the (admittedly magic) inter-GPU synchronization step had a time complexity proportional to the number of bytes written to global memory, rather than the total number of bytes of global memory that had been allocated.

Admittedly, there are some algorithms that would perform very badly, but I think this would be a great solution to take the multi-GPU scaling burden off many CUDA programs.

The latency of Infiniband is 1us, which is much smaller than just the overhead of a CUDA kernel call or mempcy (10us). So I think that should fit in well into the CUDA model.

In other words, CUDA latency is still a shamy 10us, compared to 1us latency for IB. Hopefully, CUDA latency should improve a lot, i’m not sure we can expect huge improvements in terms of high speed networks latency anymore …

That’s an important issue, n CPU - 1 GPU. But the thing is that having multiple threads submit commands to the same stream is very tricky, any way you look at it. I think you might as well just channel them through 1 CPU thread. Of course if you desire each thread to have its own independent stream, then that just fits within the current context model. What may be very useful, however, is the ability to map the same piece of device memory in multiple contexts. Another nice capability may be to let threads wait on a CUDA event without holding the context. (The latter complements the former, as well as makes all sorts of inter-thread synchronization easier.) That would really complete model.

That is tricky, for sure, but who ever said parallel programming is trivial ? :) Getting a single core to handle all requests is certainly practical for nvidia, but sometimes, you just don’t really want to first transmit a request to a core, get it to perform some cuda operation, and then have that core to produce some event back to the first core …

It is interesting to note that the very same problem happens on multirail networks (several network interface cards per node), you can have a single core to handle all communication for a given card, but a significant number of people will appreciate to be able to make MPI calls from anywhere in the code, and not necessarily from “the communication thread”. But if getting access to the different piece of data from anywhere on the machine is useful, you are certainly right when saying we may only have a single thread to submit commands/tasks at a time (possibly letting the user taking care of the locking issues …)

I totally agree with you that the important things are 1/ to get access to the device’s memory from different places, and not only from the thread holding the context 2/ being able to synchronize any cpu thread from any cpu thread. That’s really the kind of issues i’ve met by now …