Does cudaMalloc increases the private bytes used on host?

Travis182HUN · October 28, 2016, 7:20am

Hello All,

I’m quite new to CUDA C programming, so my question may seem odd, but I cannot find the answer anywhere.
I’ve wrote a bunch of kernels, and a wrapper to be able to call these from C#. I am analyzing the performance (all kernels and other methods like cudaMalloc called by this wrapper) and I realized that even when only cudaMalloc is getting called, the RAM usage is getting higher. The private bytes the process is using on host is increased by the same amount of memory allocated on device. So when I call cudaMalloc to allocate memory on device for a 4k x 4k floating-point matrix, the memory usage on host increased by the same amount (checked with dotMemory and the process explorer as well).
But that is impossible, right? cudaMalloc allocates on device, not host, the host only gets a 64 bit pointer.
I asked the guys at dotMemory if they can observe device memory as well (that would explain why I see increased unmanaged memory usage), and I got the following answer:

““Unmanaged memory” number is just the difference between “Total used” and “Total .NET used”.
It seems that allocating with “cudaMalloc” increases the “committed private bytes” of the process, please use “Task manager” of “Process Explorer” to check this.”

Could someone please enlighten me on this?

Thanks in advance!

tera · October 28, 2016, 10:56am

Does the amount of host memory used really change with the size of the device allocation?

The first call into the CUDA runtime API will initialize it and, amongst other things, create a device context on the host that holds various information about the status of the device.
If you insert a harmless cudaFree(0); before the call to cudaMalloc(), you should find that the host memory is already taken up after the cudaFree(0), and the subsequent adds little or nothing on top of that.

Travis182HUN · October 29, 2016, 10:36am

Dear Tera,

I am calling cudaMalloc in a for loop (with thread.sleep() in every cycle, for more readable memory allocation chart), and in every cycle, the “Unmanaged memory” allocation is increasing. If I comment out the cudaMalloc call, the unmanaged memory allocation is constant on host.
Btw. I am using Cuda Toolkit 8.

njuffa · October 29, 2016, 11:27am

I assume your platform is Windows? If so, I think what you are seeing is Windows reserving backing store for the device allocations, a normal side effect of using the default WDDM driver, which allows video memory to be paged out to system memory. Admittedly I have never looked into the details of WDDM memory virtualization, so I am not sure.

Travis182HUN · October 29, 2016, 5:24pm

@njuffa: now that would be a shame, since it’s a common scenario that a matrix (or anything else) is only needed on device.
But to be sure, I will try to find out how WDDM handles this situation. Thanks.

If anyone else has any idea about this, please don’t keep it for yourself :)

njuffa · October 29, 2016, 5:42pm

While WDDM is not known for its efficiency, I don’t think the reservation of virtual memory space should be a problem as there is plenty of that to go around. So I don’t think there is reason for concern here.

FWIW, I have had no luck trying to find official documentation on the internal details of WDDM memory management. Maybe I have been looking in the wrong places.

Travis182HUN · October 30, 2016, 12:21pm

Thank you for your effort looking for WDDM memory management documentation, sorry to hear that you couldn’t find anything useful.
It bugs me why the operating system (or WDDM or anything else) reserves memory on host when I didn’t ask for it, so I’ll continue looking for an answer. I’ll be in touch when I found one.

Robert_Crovella · October 30, 2016, 1:49pm

In WDDM mode, the memory of the GPU is not actually managed directly by the GPU driver (except at the lowest level). The GPU memory is managed by WDDM, and operations like cudaMalloc will trigger (by the CUDA driver) WDDM requests to allocate GPU memory (which are then serviced by the WDDM driver talking to the GPU driver…).

WDDM can oversubscribe GPU memory. This means that it has the capability to have certain items that would normally be in GPU memory “swapped out” to system memory. This is true for CUDA allocations done via cudaMalloc. The net effect is that WDDM may have more items instantiated that would normally be in GPU memory (e.g. during actual usage) than what the GPU memory footprint can support all at once. WDDM may swap items in and out of GPU memory as needed by various operations (e.g. graphics, compute, etc.)

An implication of this capability, then, is that the WDDM system may increase its memory usage as you perform certain operations on a GPU.

I personally have not witnessed a 1:1 allocation of host memory along with device memory when I am doing allocations on a WDDM GPU, but I suspect it is possible in some circumstances. It may be that you need to have a WDDM 2.0 GPU to witness this. I’m also not certain that this is an explanation for anything you are seeing, as I don’t know what you are seeing, exactly, and I almost never work in C#

You can set up a test case to witness WDDM GPU memory paging. It starts with creating an allocation for most or all of the “available” GPU memory in CUDA, then running a kernel which touches this in a loop. As WDDM pages back and forth between graphics and compute contexts, you will see memory/data being moved between GPU and CPU. It’s been a while since I’ve done this, so that may not be a complete recipe. You may need to have some non-trivial graphics app running as well.

this may be of interest:

[url]Microsoft Docs - Developer tools, technical documentation and coding examples

Travis182HUN · November 1, 2016, 12:58pm

@txbob: Thank you for your explanation, it was a great help. I’ll try this memory paging test.

alvaradasan · July 24, 2023, 5:03pm

If you are compiling your proyect in visual studio, try to compile it for x86 and x64 architectures. When compiled for 64 bits, it happens the same behaviour (for example, if you call cudaMalloc(&p, 6410241024); the process allocates in memory n_bytes_cuda_initialize + 64MB allocates by cudaMalloc call. However, when compiled for 32 bits, only n_bytes_cuda_initialize are allocated in virtual device

Maybe this happens because of the gpu, cuda version or visual studio version?

Hope this helps to find the reason behind this strange behaviour