2 won’t work. There is no way to guarantee that the addresses returned are contiguous, and no way to force that. There is no way you can take 4 pointers and make them behave as 1. For example, self-referential indices or pointers in the data would immediately break, without a lot of additional coding effort.
For item 1, you can use hints, in particular memory range-based hints using cudaMemAdvise with cudaMemAdviseSetPreferredLocation:
to create four 20GB ranges out of your 80GB buffer, and advise the preferred locations for each of those chunks, one to each GPU. You can read the documentation to get an idea of the implications and corner cases. You might also want to do cudaMemPrefetchAsync on each section, to “push” it to each GPU.
For case 1 and the follow on comments I have just now made, I think there is an assumption here that your partitioning of data into four 20GB chunks with the intent to locate one chunk on each GPU has some basis in code behavior. For example I have partitioned my code (the kernels I launch) so that kernels launched on GPU A mostly make use of the 20GB of data assigned to GPU A and make less frequent access to the data in GPU B,C,D. If instead your access patterns are totally “random”, then the memory advising probably doesn’t make sense, and this essentially becomes a performance benchmarking exercise. In the truly random case there is no strategy, because most strategies depend on some (non-trivial) knowledge of your data access patterns. If your access patterns are sufficiently random, you might not do better than case 3, or case 1 with just a straightforward cudaMallocManaged allocation and no further coding effort, for example. One of those two could be your performance baseline against which you judge any other strategy.