cudaMallocManaged allocating more memory than requested

Hello,
I was testing some code that uses cudaMallocManaged function to allocate some data structures to be accessible from both device and host. I noticed that the function actually allocates almost 100 times the size of requested memory. I thought that this happens because the cudaMalloc functions allocate 512-byte alligned memory blocks on my GPU, so I’ve made some experiments. It turns out that while the cudaMalloc function allocates the quantity of requested memory, the cudaMallocFunction simply doesn’t, allocating a lot more bytes of memory. I attach my testing code:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "device_functions.h"

void* gpuMalloc(unsigned size) {
	static unsigned long long memory = 0;
	static unsigned long long memoryPred = 0;
	static void *previous = NULL;
	void *p;

	if (cudaMalloc(&p, size) == cudaSuccess) {

		if (previous != NULL)
			memory += ((ptrdiff_t*)p - previous) * sizeof(ptrdiff_t);

		if ((float)size / 512 != size / 512)
			memoryPred += (size / 512 + 1) * 512;
		else
			memoryPred += (size / 512) * 512;

		printf("cudaMalloc:\n%fGB predicted; %fGB actually allocated\n", (float)memoryPred / (1024 * 1024 * 1024), (float)memory / (1024 * 1024 * 1024));
		previous = p;
		return p;
	}
	return NULL;
}

void* gpuMallocShared(unsigned size) {
	static unsigned long long memory = 0;
	static unsigned long long memoryPred = 0;
	static void *previous = NULL;
	void *p;

	if (cudaMallocManaged(&p, size, cudaMemAttachHost) == cudaSuccess) {

		if(previous != NULL)
			memory += ((ptrdiff_t*)p - previous) * sizeof(ptrdiff_t);

		if ((float)size / 512 != size / 512)
			memoryPred += (size / 512 + 1) * 512;
		else
			memoryPred += (size / 512) * 512;

		printf("cudaMallocManaged:\n%fGB predicted; %fGB actually allocated\n", (float)memoryPred / (1024 * 1024 * 1024), (float)memory / (1024 * 1024 * 1024));
		previous = p;
		return p;
	}

	return NULL;
}

int main() {	
	void *ptr;

	while (1) {
		ptr = gpuMallocShared(sizeof(unsigned char));
		ptr = gpuMalloc(sizeof(unsigned char));
		getchar();
	}

	return 0;
}

I’ve tested this code on my laptop:
Windows 10 64-bit
Intel Core i7-7700HQ
16GB RAM
Nvidia GeForce GTX 1060 6GB
Cuda 9.1
Shader Model 3.5, Compute Capability 3.5, -rdc=true --machine 64

It would be useful to understand why this happens.
Thanks,
Francesco

I would think that cudaMallocManaged() is allocating memory pages corresponding to your operating system’s page size. This is because pages of managed memory need to be swapped between CPU and GPU on demand.

If you need to make more fine grained allocations, consider putting your own memory pooling logic on top of memory returned by cudaMallocManaged().

Christian

There is definitely an allocation granularity, and it may be larger than the size of the allocations you are requesting. This would inflate the amount of memory actually used.

So I should write my own memory manager in order to allocate bigger amounts of data and redistribute it?
For example if I request 512MB of memory in small portions of 512 bytes, cuda will almost saturate my VRAM. This can be tested with this code:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "device_functions.h"

void* gpuMalloc(unsigned size) {
	static unsigned long long memory = 0;
	static unsigned long long memoryPred = 0;
	static void *previous = NULL;
	void *p;

	if (cudaMalloc(&p, size) == cudaSuccess) {

		if (previous != NULL)
			memory += ((ptrdiff_t*)p - previous) * sizeof(ptrdiff_t);

		if ((float)size / 512 != size / 512)
			memoryPred += (size / 512 + 1) * 512;
		else
			memoryPred += (size / 512) * 512;

		printf("cudaMalloc:\n%fGB predicted; %fGB actually allocated\n", (float)memoryPred / (1024 * 1024 * 1024), (float)memory / (1024 * 1024 * 1024));
		previous = p;
		return p;
	}
	return NULL;
}

void* gpuMallocShared(unsigned size) {
	static unsigned long long memory = 0;
	static unsigned long long memoryPred = 0;
	static void *previous = NULL;
	void *p;

	if (cudaMallocManaged(&p, size, cudaMemAttachHost) == cudaSuccess) {

		if(previous != NULL)
			memory += ((ptrdiff_t*)p - previous) * sizeof(ptrdiff_t);

		if ((float)size / 512 != size / 512)
			memoryPred += (size / 512 + 1) * 512;
		else
			memoryPred += (size / 512) * 512;

		printf("cudaMallocManaged:\n%fGB predicted; %fGB actually allocated\n", (float)memoryPred / (1024 * 1024 * 1024), (float)memory / (1024 * 1024 * 1024));
		previous = p;
		return p;
	}

	return NULL;
}

int main() {	
	void *ptr;

	for(int i = 0; i < 1024*1024; i++)
		ptr = gpuMallocShared(sizeof(unsigned char) * 512);
	
	getchar();

	return 0;
}

If I allocate 512MB in one block, CUDA will allocate almost 700MB of memory.

Is it worth to use a sub-allocator?

It would be very useful to know how cudaMallocManaged actually allocate memory, because this actually doesn’t make sense for me.
Thanks,
Francesco

If you allocate 512MB in one block, that will require about 512MB of your GPU memory. The difference between 512MB and 700MB is CUDA overhead. It should not increase (much) as you allocate more memory.

For example, if you do the first allocation of 512MB, you may witness that 700MB is used on the GPU. After that, if you allocate another 512MB, you should witness about 1200MB used on the GPU.

Only you can decide if it is worth it to use a sub-allocator.

Your method to estimate used memory by comparing pointers is a weird one

memory += ((ptrdiff_t*)p - previous) * sizeof(ptrdiff_t);

You’re operating in a virtual memory space where addresses may or may not have physical memory pages (on either CPU or GPU or both) behind them.

If you plan to do memory pooling in CUDA, google for existing solutions first. There may be quite a few libraries available already - optimized for different use cases.

The method actually matches with VS profiler data, so I assume that cuda allocates memory sequentially. It is fun to notice that there is no Garbage Collection mechanism behid the allocator and/or the compiler (but maybe that is due to optimization level set to debug).

I tried to search for CUDA optimized suballocators, but there are not suitable for my purposes. Wish me good luck so…

there’s a related thread here, mostly with references to academic papers

https://devtalk.nvidia.com/default/topic/878664/custom-memory-allocator-for-cuda-desired/