cudaMallocManaged malloc memory not same as requested

Here are my full code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>

class Managed {
public:
	void *operator new(size_t len) {
		void *ptr;
		cudaMallocManaged(&ptr, len);
		return ptr;
	}
	void operator delete(void *ptr) {
		cudaFree(ptr);
	}
};

class Encounter : public Managed {
public:
	int ID; // 4 bytes
	int Age; // 4 bytes
	char Gender; // 1 byte	
	Encounter(const int id, const int age, const char gender) {		
		ID = id;
		Age = age;
		Gender = gender;		
	}
	~Encounter() { }
};

__global__ void Launch(Encounter ** e) { }

int main()
{
	const int caseCount = 100;	
	std::cout << sizeof(char) << std::endl;
	cudaError_t cudaStatus = cudaSetDevice(0);
	size_t mf, ma;
	cudaMemGetInfo(&mf, &ma);

	Encounter ** encounters;
	cudaMallocManaged(&encounters, sizeof(Encounter**) * caseCount);
	size_t lastFree = mf;
	for (int i = 0; i < caseCount; i++) {
		Encounter * e = new Encounter(i, 18, 'F');
		encounters[i] = e;		

		cudaMemGetInfo(&mf, &ma);
		std::cout << "Free:" << mf << " Total:" << ma << " Used:" << (lastFree - mf) << std::endl;
		lastFree = mf;
	}	
	Launch << <1, 1 >> > (encounters);
	cudaDeviceSynchronize();
	
	cudaDeviceReset();
	return 0;
}

Let me explain:
I surpose that Encounter object size will be: 9 bytes (int + int + char), but due to alignment 512 for x64, lead to the real malloc size will be 512:
So 100 array length will be: 512 * 100 = 51,200
Plus cudaMallocManaged of array of encounter: 8 * 100 = 800 => real alloc memory = 1024

So total real malloc memory should be: 1024 + 51,200 = 52,224

But my output test:

Free:1723825766 Total:2147483648 Used:2097152
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:2097152
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:2097152
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:2097152
Free:1717534310 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:0

I knew that 2097152 is cache/page size, but due to this value does not over of 52,224, so it should be only one used register memory.

Could you please help me to find a calculation to get right value of my output test?

I tested on:
Windows 10 64-bit
Nvidia GeForce GTX 1050

Thank you so much!
VyPhan

I’m not sure what your goal is here.

The alignment value of 512 is not relevant to the size of the allocation. The allocation has a granularity (page size) and the underlying memory manager may implement a pool allocator, which means that not every allocation of a particular type will reduce the reported memory by the same value.

You have expectations which are not valid.

Hi Robert,

I know what do you mean, the problem I want to be explored that why is the requested malloc and real malloc very different: 52,224 bytes ~ 2,097,152 * 4 (pages). You can try my full example code above.

Thank for your response!

For efficiency, the first time you make a managed allocation request, the pool allocator pulls 2MB of memory out of the unallocated heap and prepares it for allocation in pages. Each page is no less than 4 kbytes (may vary by OS, platform, CUDA version, GPU, driver version, driver model, etc.), but in this case is actually 2MB/32 = 64kbytes. and even if your single allocation request is only for 9 bytes, that allocation will take at least 1 page, i.e. at least 64 kbytes out of the pool. So after 32 of these requests, another pool allocation will need to be done, subtracting another 2MB of memory from the unallocated heap. and the process repeats. Each allocation must use a set of pages that are exclusively used/reserved for it. If you think carefully about this, you will see why. Two different allocations cannot share the same set of underlying pages.

So on your platform/setup, a single allocation is using a minimum of 64kbytes.

Thanks Robert, that makes sense.

How can I get the single allocation minimum size (my case is 64kb), and is there any way to reduce that value?
This only happen on CudaMallocManaged, not CudaMalloc.

I’m not aware of any direct programmatic way to get the allocation size (except the inferential method you have already created).

I’m not aware of any controls the programmer has over this allocation granularity.

As an aside:

Your GPU is in WDDM driver mode.

I suspect some of this is influenced by WDDM driver model, controlled (at least at the API/behavioral level) by Microsoft.

For a GPU in WDDM mode, the GPU memory is managed by the windows WDDM driver, not directly by any NVIDIA code. So CUDA must make a request to the windows OS for the memory it wishes to use.

well noted,

Thank so much Robert

You’re not the first to run into this:

https://devtalk.nvidia.com/default/topic/1037440/cuda-programming-and-performance/cudamallocmanaged-allocating-more-memory-than-requested/

https://stackoverflow.com/questions/36778089/does-cudamallocmanaged-allocate-memory-on-the-device

https://devtalk.nvidia.com/default/topic/974224/cuda-programming-and-performance/multidimensional-array-allocation-with-cuda-unified-memory-/