cudaMalloc, on a windows platform, with WDDM driver model, does not actually directly allocate memory on the GPU. It makes a request of the WDDM driver (a Microsoft API) to request the allocation.
The WDDM system actually manages GPU memory (plus sysmem as a fallback) via something like a demand-paged virtual memory system. In a WDDM setup, the CUDA subsystem is just one of potentially several users of the GPU (the other big one being the windows display system). This memory management is not directly under the control of the NVIDIA driver, and has a few notable effects:
- oversubscription is possible. I’ve not personally witnessed oversubscription for a single client (e.g. CUDA) but I have witnessed oversubscription when there are multiple clients making memory requests
- paging (i.e. movement/relocation) of data is possible, between device memory and sysmem
These characteristics are not under control of the NVIDIA driver. The WDDM system is allowed to manage memory as it sees fit, and it may choose to move data from device memory to sys mem as it sees fit. However, when a CUDA kernel is running, for example, my understanding is any necessary device memory allocations will be moved to and actually resident in device memory, for the CUDA client to use.
As you point out, cudaMallocManaged has somewhat different behavior, and in fact the behavior on windows is not demand paged, but the data does migrate at certain points.
This movement of data is not under control of NVIDIA driver. It is expected functionality and not a bug. There would be no reason to return e.g. an out-of-memory error, and in fact the NVIDIA driver has no knowledge of whether the data will be moved and be resident in sysmem, and therefore would have no basis to offer such an error, anyway.
If you don’t like this behavior, for certain kinds of NVIDIA GPUs, you can select an alternate driver model, eg. TCC, which will take the WDDM subsystem out of the picture. (This would remove the possibility for using that GPU for display purposes.) Another option might be to switch to linux.
This condition of WDDM has been true for a long time (at least 5 years or more) and is not dependent on a new development in any recent NVIDIA drivers, in order to observe the basic data movement effect.