Does Cuda Unified Memory let multiple GPUs access randomly on non-overlapping regions of host array, concurrently?

For example, two GPUs(Kepler or cc 3.0+ ) launch kernels concurrently and,

GPU-1 writes to a[0], a[155], a[1000] (will not collide with 1,200,3,5 ever)

GPU-2 writes to a[1], a[200], a[3], a[5] (will not collide with 0,155,1000 ever)

After streams are synchronized on host(no memcpy, just using Unified Memory), can we trust on data specifically on the “CPU” side, where accesses for read will be between indices 0 and 1000?

I don’t care if a GPU sees other GPUs writes. I’m asking only for what CPU will see.

If there is no problem, what kind of performance degradation can be expected? For example, fully randomized writes(but again, no collisions on 8-byte-wide regions) to a 50MB array, using 3 GPUs concurrently.

This may work under certain specific conditions (e.g. CUDA 9.1, linux, P2P capable). In other scenarios, it may not. For general/robust usage, it’s probably not a good design pattern.
See comments below.

I wonder how it works? a[0] and a[1] are in the same cache line, does GPU performs word-granular writes over PCI-E bus, or somewhat locks the entire memory line for duration of read-modify-write operation?

I expect slowdown ofcourse. Maybe 2 lucky gpus send data at the same time so that they got both interleaved in same pci-e stream, then get in same cache line, maybe 0.001% probability? Can both gpu data get interleaved such that gpu-1-byte-1 + gpu-2-byte-1 + gpu-1-byte-2 + … so they get served equally? Or one GPU stops other for its own time(of pci-e transfer)? I mean, is pci-e totally a serial thing?

I haven’t thought through every possible UM regime. It may not work on some regimes. I should revise my previous response.

The one regime I had in mind was a CUDA 9/9.1 linux regime, where the GPUs are on the same fabric.

In that case, my expectation is that when GPU 0 attempted to touch say, a[0], it would demand page that to GPU 0. When GPU 1 attempted to touch say, a[1], it would demand-page the same page from GPU 0 to GPU 1, effectively invalidating local access to the page by GPU 0. Any changes made by GPU 0 should be flushed before the page is sent to GPU 1 (I believe). Even if the UM system created a mapping from GPU 0 to GPU 1 for this page, over NVLINK 2.0 (Volta) there is coherency in this scenario.

However, in a multi-GPU scenario in other UM regimes, the allocation will fall back to becoming a host (pinned) allocation. In that case, I’m not sure about the behavior. AFAIK system-memory transactions are not cached from the GPU perspective (i.e. they immediately turn into PCIE bus transactions) but if the R/W transactions were close enough in time there may still be a race condition. I haven’t investigated it thoroughly. So I’ve revised my previous response.

oh, i completely skipped mention of UM in the original question, and thought that it’s just about pinned host memory, which is accessed with transaction over PCI-E for each operation

in case of UM, the answer - it doesn’t work: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ :

If you have multiple GPUs, rules are the same - each GPU need to import all the data before kernel can be started on this GPU.

For page-level migration, you need Pascal/Volta GPU and Linux x64 - it’s not supported for other combinations. And even in this case, random access to this array will be extremely inefficient, since for 50% of writes, page is on the wrong side and you have to move entire 4 KB over bus

You may have better chances with old-fashion host-only pinned memory allocation. In this case, each write sent over PCI-E, but i’m not sure about coherency and efficiency of such approach. It may be better just to create one array per GPU and then merge them.

Good point, I was thinking demand-paging, which is not available on Kepler GPUs. So my answer was completely off-base.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd