cudaMallocManaged malloc memory not same as requested

thanhvyit · April 16, 2019, 8:50am

Here are my full code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>

class Managed {
public:
	void *operator new(size_t len) {
		void *ptr;
		cudaMallocManaged(&ptr, len);
		return ptr;
	}
	void operator delete(void *ptr) {
		cudaFree(ptr);
	}
};

class Encounter : public Managed {
public:
	int ID; // 4 bytes
	int Age; // 4 bytes
	char Gender; // 1 byte	
	Encounter(const int id, const int age, const char gender) {		
		ID = id;
		Age = age;
		Gender = gender;		
	}
	~Encounter() { }
};

__global__ void Launch(Encounter ** e) { }

int main()
{
	const int caseCount = 100;	
	std::cout << sizeof(char) << std::endl;
	cudaError_t cudaStatus = cudaSetDevice(0);
	size_t mf, ma;
	cudaMemGetInfo(&mf, &ma);

	Encounter ** encounters;
	cudaMallocManaged(&encounters, sizeof(Encounter**) * caseCount);
	size_t lastFree = mf;
	for (int i = 0; i < caseCount; i++) {
		Encounter * e = new Encounter(i, 18, 'F');
		encounters[i] = e;		

		cudaMemGetInfo(&mf, &ma);
		std::cout << "Free:" << mf << " Total:" << ma << " Used:" << (lastFree - mf) << std::endl;
		lastFree = mf;
	}	
	Launch << <1, 1 >> > (encounters);
	cudaDeviceSynchronize();
	
	cudaDeviceReset();
	return 0;
}

Let me explain:
I surpose that Encounter object size will be: 9 bytes (int + int + char), but due to alignment 512 for x64, lead to the real malloc size will be 512:
So 100 array length will be: 512 * 100 = 51,200
Plus cudaMallocManaged of array of encounter: 8 * 100 = 800 => real alloc memory = 1024

So total real malloc memory should be: 1024 + 51,200 = 52,224

But my output test:

Free:1723825766 Total:2147483648 Used:2097152
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1723825766 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:2097152
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1721728614 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:2097152
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1719631462 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:2097152
Free:1717534310 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:0
Free:1717534310 Total:2147483648 Used:0

I knew that 2097152 is cache/page size, but due to this value does not over of 52,224, so it should be only one used register memory.

Could you please help me to find a calculation to get right value of my output test?

I tested on:
Windows 10 64-bit
Nvidia GeForce GTX 1050

Thank you so much!
VyPhan

Robert_Crovella · April 16, 2019, 4:19pm

I’m not sure what your goal is here.

The alignment value of 512 is not relevant to the size of the allocation. The allocation has a granularity (page size) and the underlying memory manager may implement a pool allocator, which means that not every allocation of a particular type will reduce the reported memory by the same value.

You have expectations which are not valid.

thanhvyit · April 17, 2019, 12:52am

Hi Robert,

I know what do you mean, the problem I want to be explored that why is the requested malloc and real malloc very different: 52,224 bytes ~ 2,097,152 * 4 (pages). You can try my full example code above.

Thank for your response!

Robert_Crovella · April 17, 2019, 1:09am

For efficiency, the first time you make a managed allocation request, the pool allocator pulls 2MB of memory out of the unallocated heap and prepares it for allocation in pages. Each page is no less than 4 kbytes (may vary by OS, platform, CUDA version, GPU, driver version, driver model, etc.), but in this case is actually 2MB/32 = 64kbytes. and even if your single allocation request is only for 9 bytes, that allocation will take at least 1 page, i.e. at least 64 kbytes out of the pool. So after 32 of these requests, another pool allocation will need to be done, subtracting another 2MB of memory from the unallocated heap. and the process repeats. Each allocation must use a set of pages that are exclusively used/reserved for it. If you think carefully about this, you will see why. Two different allocations cannot share the same set of underlying pages.

So on your platform/setup, a single allocation is using a minimum of 64kbytes.

thanhvyit · April 17, 2019, 1:22am

Thanks Robert, that makes sense.

How can I get the single allocation minimum size (my case is 64kb), and is there any way to reduce that value?
This only happen on CudaMallocManaged, not CudaMalloc.

Robert_Crovella · April 17, 2019, 1:29am

I’m not aware of any direct programmatic way to get the allocation size (except the inferential method you have already created).

I’m not aware of any controls the programmer has over this allocation granularity.

As an aside:

Your GPU is in WDDM driver mode.

I suspect some of this is influenced by WDDM driver model, controlled (at least at the API/behavioral level) by Microsoft.

For a GPU in WDDM mode, the GPU memory is managed by the windows WDDM driver, not directly by any NVIDIA code. So CUDA must make a request to the windows OS for the memory it wishes to use.

thanhvyit · April 17, 2019, 1:33am

well noted,

Thank so much Robert

Robert_Crovella · April 17, 2019, 1:34am

You’re not the first to run into this:

[url]https://devtalk.nvidia.com/default/topic/1037440/cuda-programming-and-performance/cudamallocmanaged-allocating-more-memory-than-requested/[/url]

[url]cuda - Does CudaMallocManaged allocate memory on the device? - Stack Overflow

[url]https://devtalk.nvidia.com/default/topic/974224/cuda-programming-and-performance/multidimensional-array-allocation-with-cuda-unified-memory-/[/url]

Topic		Replies	Views
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3085	July 13, 2018
cudaMemGetInfo returns wrong amount free memory CUDA Programming and Performance	3	5190	December 11, 2012
cudaMallocManaged() clarification needed CUDA Programming and Performance	5	10848	November 20, 2018
cudaMalloc error in big loop CUDA Programming and Performance	12	15577	May 21, 2008
CudaMalloc on Vista : strange behaviour Works on XP, Fails on Vista CUDA Programming and Performance	6	12257	July 1, 2009
CUDA 4.0: linux malloc for page-aligned memory and cudaHostRegister How to malloc page-aligned memor CUDA Programming and Performance	9	19311	March 11, 2011
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Technical Blog	12	1119	September 12, 2023
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	9565	March 30, 2022
Cuda runtime call after driver api call, excessive overhead CUDA Programming and Performance cuda , driver , api	17	1856	December 24, 2021
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	65568	September 7, 2023

cudaMallocManaged malloc memory not same as requested

Related Topics