Does Cuda Unified Memory let multiple GPUs access randomly on non-overlapping regions of host array, concurrently?

tugrul_192bit · March 29, 2018, 9:21am

For example, two GPUs(Kepler or cc 3.0+ ) launch kernels concurrently and,

GPU-1 writes to a[0], a[155], a[1000] (will not collide with 1,200,3,5 ever)

GPU-2 writes to a[1], a[200], a[3], a[5] (will not collide with 0,155,1000 ever)

After streams are synchronized on host(no memcpy, just using Unified Memory), can we trust on data specifically on the “CPU” side, where accesses for read will be between indices 0 and 1000?

I don’t care if a GPU sees other GPUs writes. I’m asking only for what CPU will see.

If there is no problem, what kind of performance degradation can be expected? For example, fully randomized writes(but again, no collisions on 8-byte-wide regions) to a 50MB array, using 3 GPUs concurrently.

Robert_Crovella · March 29, 2018, 11:11am

This may work under certain specific conditions (e.g. CUDA 9.1, linux, P2P capable). In other scenarios, it may not. For general/robust usage, it’s probably not a good design pattern.
See comments below.

BulatZiganshin · March 30, 2018, 8:33am

I wonder how it works? a[0] and a[1] are in the same cache line, does GPU performs word-granular writes over PCI-E bus, or somewhat locks the entire memory line for duration of read-modify-write operation?

tugrul_192bit · March 30, 2018, 8:50am

I expect slowdown ofcourse. Maybe 2 lucky gpus send data at the same time so that they got both interleaved in same pci-e stream, then get in same cache line, maybe 0.001% probability? Can both gpu data get interleaved such that gpu-1-byte-1 + gpu-2-byte-1 + gpu-1-byte-2 + … so they get served equally? Or one GPU stops other for its own time(of pci-e transfer)? I mean, is pci-e totally a serial thing?

Robert_Crovella · March 30, 2018, 2:48pm

I haven’t thought through every possible UM regime. It may not work on some regimes. I should revise my previous response.

The one regime I had in mind was a CUDA 9/9.1 linux regime, where the GPUs are on the same fabric.

In that case, my expectation is that when GPU 0 attempted to touch say, a[0], it would demand page that to GPU 0. When GPU 1 attempted to touch say, a[1], it would demand-page the same page from GPU 0 to GPU 1, effectively invalidating local access to the page by GPU 0. Any changes made by GPU 0 should be flushed before the page is sent to GPU 1 (I believe). Even if the UM system created a mapping from GPU 0 to GPU 1 for this page, over NVLINK 2.0 (Volta) there is coherency in this scenario.

However, in a multi-GPU scenario in other UM regimes, the allocation will fall back to becoming a host (pinned) allocation. In that case, I’m not sure about the behavior. AFAIK system-memory transactions are not cached from the GPU perspective (i.e. they immediately turn into PCIE bus transactions) but if the R/W transactions were close enough in time there may still be a race condition. I haven’t investigated it thoroughly. So I’ve revised my previous response.

BulatZiganshin · March 30, 2018, 5:23pm

oh, i completely skipped mention of UM in the original question, and thought that it’s just about pinned host memory, which is accessed with transaction over PCI-E for each operation

in case of UM, the answer - it doesn’t work: Unified Memory for CUDA Beginners | NVIDIA Technical Blog :

On pre-Pascal GPUs, upon launching a kernel, the CUDA runtime must migrate all pages previously migrated to host memory or to another GPU back to the device memory of the device running the kernel2. Since these older GPUs can’t page fault, all data must be resident on the GPU just in case the kernel accesses it (even if it won’t). This means there is potentially migration overhead on each kernel launch.

Pascal GPUs such as the NVIDIA Titan X and the NVIDIA Tesla P100 are the first GPUs to include the Page Migration Engine, which is hardware support for Unified Memory page faulting and migration.

Unlike the pre-Pascal GPUs, the Tesla P100 supports hardware page faulting and migration. So in this case the runtime doesn’t automatically copy all the pages back to the GPU before running the kernel. The kernel launches without any migration overhead, and when it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads.

Simultaneous access to managed memory from the CPU and GPUs of compute capability lower than 6.0 is not possible. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault.

If you have multiple GPUs, rules are the same - each GPU need to import all the data before kernel can be started on this GPU.

For page-level migration, you need Pascal/Volta GPU and Linux x64 - it’s not supported for other combinations. And even in this case, random access to this array will be extremely inefficient, since for 50% of writes, page is on the wrong side and you have to move entire 4 KB over bus

You may have better chances with old-fashion host-only pinned memory allocation. In this case, each write sent over PCI-E, but i’m not sure about coherency and efficiency of such approach. It may be better just to create one array per GPU and then merge them.

Robert_Crovella · March 30, 2018, 5:28pm

Good point, I was thinking demand-paging, which is not available on Kepler GPUs. So my answer was completely off-base.

[url]Programming Guide :: CUDA Toolkit Documentation