To my delight, I recently discovered that UMA finally has resonable performance on Windows if you have a PCIe 4.0 system. I don’t have one, but my project sponsor does.
I have a notebook with a GTX 1660 Ti with 6GB memory, and the Windows 10 host has 16GB of memory and an 8GB page file. The largest UMA array I can allocate is about 5GB. Another system has two GTX 1080 TIs with 11GB memory each, and the Windows 10 host has 32GB of memory with a 32GB page file. On this system I can allocate a UMA array of 15GB. Both systems use CUDA version 11.
I am confused about (1) why I can’t oversubscribe on the 1660, and (2) why I can oversubscribe on the 1080, but only to 15GB.
Does anybody know what factor(s) limit the maximum UMA allocation size?
Update: If I hide one of the 1080s, it becomes clear that oversubscription is NOT work on that system either. It appears that the subset of the UMA functionarly of Windows is faster with PCIe 4.0, you still can use oversubscription because they still haven’t implemented page faulting.