Can a large buffer be "split" across multiple GPUs?

[Sorry if this is an easy question, but I haven’t found a discussion of it either in the CUDA SDK documentation or in this forum.]

Imagine you have a randomly-accessed buffer of size 80GB that must reside in CUDA global memory, and a machine with four 32GB V100s (and NVlink). Is it possible to split that buffer into (say) four 20GB partitions, each of which resides on one of the four GPUs, and then address the entire buffer from all four of the GPUs with a unified 80GB range of addresses?

I can imagine doing this by implementing my own buffer allocation and address translation. The question is whether there is a CUDA API incantation that would accomplish the same thing.

Thank you!

Maybe. Possibly 3 methods, maybe:

  1. If in a CUDA 9.x or 10.x regime on linux, you can use managed memory oversubscription to handle this case. Just do cudaMallocManaged for an 80GB buffer, and run with it. Pass that pointer to all GPUs that need to access it. (so this probably doesn’t fit your description, exactly, if so disregard).

  2. If your GPUs are all in a peerable relationship with each other, you could split the buffer, say, into four 20GB pieces. Do a cudaMalloc allocation on each GPU for the 20GB chunk. Copy a 20GB chunk to each GPU. Put all 4 GPUs into a peer clique using cudaDeviceEnablePeerAccess. Pass all 4 pointers to each GPU. Profit. (so this probably doesn’t fit your description, exactly, if so disregard).

  3. Use host-pinned memory. The single pointer returned from cudaHostAlloc is accessible on all 4 GPUs. (so this probably doesn’t fit your description, exactly, if so disregard).

These 3 methods are obviously different. There may be other approaches also. Probably none of these options perfectly fit your description. If none of these are useful, then my guess is the answer to your question is “no, not possible”.

Thank you for your prompt reply and for your ideas!

To be more specific: the 80GB range of addresses mapped into the address space of all four of the GPUs ought to be the same. (Imagine that the buffer contains a lookup table, a hash table, or some other structure whose contents are indexed by computing an address from some given data.) And the range of addresses ought to be contiguous, so that the correct address can be computed without knowing which chunk of the buffer resides on which GPU.

With this in mind:

#1 might work if a) we can force the memory manager to keep everything in GPU memory and b) we can distribute the data evenly across all GPUs. We need to reserve some memory on each GPU to compute with!

#2 sounds good, but I don’t know how to guarantee that the addresses of the four buffers would be mapped contiguously. Do you know if the driver works that way if it has to map (say) four consecutive memory allocations into GPU address space? (Just experimenting with it won’t help. We’d want a guarantee from the driver that we’d get contiguously-mapped address ranges on any hardware/driver configuration.)

#3 actually works (which we know from experience) but it’s slow because all memory accesses traverse the PCIe bus. The goal is to get the data into CUDA global memory and exploit NVlink to try for better speed.

It sounds like we’re “close”, i.e. method #1 ensures contiguous addressing and method #2 ensures that the allocations reside in device memory. Do you know enough about the innards of the memory manager to decide which way to go?

Thanks again…

2 won’t work. There is no way to guarantee that the addresses returned are contiguous, and no way to force that. There is no way you can take 4 pointers and make them behave as 1. For example, self-referential indices or pointers in the data would immediately break, without a lot of additional coding effort.

For item 1, you can use hints, in particular memory range-based hints using cudaMemAdvise with cudaMemAdviseSetPreferredLocation:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge37112fc1ac88d0f6bab7a945e48760a

to create four 20GB ranges out of your 80GB buffer, and advise the preferred locations for each of those chunks, one to each GPU. You can read the documentation to get an idea of the implications and corner cases. You might also want to do cudaMemPrefetchAsync on each section, to “push” it to each GPU.

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/
http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf

For case 1 and the follow on comments I have just now made, I think there is an assumption here that your partitioning of data into four 20GB chunks with the intent to locate one chunk on each GPU has some basis in code behavior. For example I have partitioned my code (the kernels I launch) so that kernels launched on GPU A mostly make use of the 20GB of data assigned to GPU A and make less frequent access to the data in GPU B,C,D. If instead your access patterns are totally “random”, then the memory advising probably doesn’t make sense, and this essentially becomes a performance benchmarking exercise. In the truly random case there is no strategy, because most strategies depend on some (non-trivial) knowledge of your data access patterns. If your access patterns are sufficiently random, you might not do better than case 3, or case 1 with just a straightforward cudaMallocManaged allocation and no further coding effort, for example. One of those two could be your performance baseline against which you judge any other strategy.

Thanks again for helping me think this through and for the links to relevant documentation.

I don’t want to “overthink” this. Again, we essentially have a readonly hashtable-like scenario where access to a very large lookup table is expected to be random, i.e., a thread on any of the peer GPUs could issue a read from anywhere in the table. Consequently, it seems to me that it would make sense to distribute the data evenly across all of the GPU devices, and that driver-level heuristics (e.g., inter-GPU data migration) would be ineffective at best.

I think the first thing to try is simply to enable peer access, partition the data across all of the GPU devices, and convert buffer offsets to physical addresses explicitly in code. (My guess is that cost of computing a physical address from a buffer offset would be negligible compared to global-memory latency anyway.) We’ll see how that goes before attempting anything more subtle with cudaMallocManaged.

That will likely be best if you’re willing to break your single pointer/contiguous access into 4 discontiguous ranges with 4 pointers (choice 2). It seemed like you didn’t want to do that which is why I mentioned the others.

A brief followup: when compared to maintaining a single unified table in CUDA “pinned” memory (page-locked system RAM), we get 10-20x speedup in kernels that randomly access the same data partitioned across multiple GPUs. The implementation uses NVlink, “peer access” everywhere, and explicit address translation in CUDA kernel code.

Too bad this kind of thing isn’t supported through the memory-virtualization functionality in CUDA. Perhaps it’s not a common-enough use case.

Thanks again for helping me think this through.