In general, documentation for managed memory behavior is in the CUDA C++ programming guide, section L. Not sure how to find the CUDA C++ programming guide? Go to docs.nvidia.com and click on the “CUDA Toolkit Documentation” link. Hopefully you can navigate from there.
In your original question, your allocation amount seemed to be below the memory size of your GPU. Therefore it’s possible the explanation I had given you was approximately correct (although I believe I made a mistake - see below).
Managed memory in windows follows the pre-pascal regime/behavioral description. There are still at least 2 behavioral possibilities in a multi-GPU scenario.
First, if the system topology permits both GPUs to be in a peer relationship. In that case, the data will be migrated to the GPU that needs it, en-masse, at kernel launch. If 2 GPUs need it, one processor will be given a peer mapping to the data on the other processor.
Second, if the system topology does not permit the GPUs to be in a peer relationship, then instead the allocation will be made as if it were a pinned host allocation, and both GPUs will be given a pointer to the host allocation. In this case migration of data is not necessary.
Based on your initial description I assumed we were in the first category. However in the first category, if you try to allocate more than the memory that is available on the GPU, you get an out of memory error from the cudaMallocManaged call - I just tested that on my windows system and that is what I get.
Now that you are reporting that you can successfully allocate more memory than that, I would assume your GPUs cannot be placed into a peer relationship, so that would put your GPUs into the second category. However in that case the whole statement about migration does not hold water. Unfortunately, I don’t have a system exactly like that to test.
I would also like to amend a previous remark that was I believe in error. In the first category, data migration occurs en-masse to the GPU at the point of kernel launch. However, I misspoke about the return case. After a subsequent cudaDeviceSynchronize(), it is true that data can now be accessed again in host memory, but I’m not certain there is an en-masse data copy. I think the data copy in this direction can still be demand-paged. The net effect of this would be that the forced transfers would only be happening in one direction, which would seem to align with your transfer rate calculation (but inconsistent with the other data point).
Anyway I’m afraid I may have just confused matters.
Moving on to your question, I believe you can still use Unified Memory in the case you describe. As we have now discussed, cudaDeviceSynchronize() has this special characteristic in the pre-pascal regime (i.e. windows regime) that it makes the data visible again on the host, and will result in trigger of transfer of data on subsequent kernel calls.
But cudaDeviceSynchronize() is not the only way to enforce what you want. First of all if you only had 1 GPU, stream semantics would suffice. With 2 GPUs, you could try using cudaStreamSynchronize(), and if that doesn’t work I’m fairly confident that using events would.
cudaStreamSynchronize() method would look like this:
setDevice(0);
kernelA(a, stream1) // work on a[0];
setDevice(1);
kernelA(a+1, stream2) // work on a[1];
cudaSetDevice(0);
cudaStreamSychronize(stream1); // kernelB needs entire a, so sync is needed
cudaSetDevice(1);
cudaStreamSynchronize(stream1);
setDevice(0);
kernelB(a,x);
setDevice(1);
kernelB(a,x);
the event method would look like this:
setDevice(0);
kernelA(a) // work on a[0];
cudaEventRecord(evt0);
setDevice(1);
kernelA(a+1) // work on a[1];
cudaEventRecord(evt1);
cudaStreamWaitEvent(stream1, evt0)
cudaStreamWaitEvent(stream1, evt1)
setDevice(0);
kernelB(a,x, stream1);
setDevice(1);
kernelB(a,x, stream1);
Just as your code is shorthand, I’m just trying to give shorthand ideas. These are not exact recipes.