Question about GPU Memory Overhead with Cudamallocmanaged

I’m rather new to CUDA and I was wondering I could receive some pointers relating to GPU memory allocation.
I have this very simple testing program here:

struct heh
	unsigned char *a;
	unsigned int b;
	heh(unsigned int _b);

heh::heh(unsigned int _b)
	b = _b;
	cudaError_t err = cudaMallocManaged(&a, b);
	if (err != cudaSuccess)
		printf("Error: %s\n", cudaGetErrorString(err));


int main()
	vector<heh*> neato;

	unsigned int amount = 10000;
	unsigned int size = 1;

	for (unsigned int a = 0; a < amount; a++)
		heh *mem = new heh(size);
		for (unsigned int b = 0; b < size; b++)
			mem->a[b] = rand() % 256;

	unsigned long tot = 0;
	for (unsigned int q = 0; q < 10000; q++)
		tot = q;
		for (unsigned int a = 0; a < amount; a++)
			for (unsigned int b = 0; b < size; b++)
				tot += neato[a]->a[b];
	printf("Wow %ld\n", tot);

	for (unsigned int a = 0; a < amount; a++)
		delete neato[a];


	return 0;

And I run it via nvprof. I run it two ways. The first being by setting size to 10000 and amount to 1, and the other by reversing the numbers. It’s to be expected that there’s more overhead with 10k objects 1 byte large as opposed to 1 ~10kb object. What I don’t understand however is that when I run it with 10k 1 byte objects task manager / visual studio will say my GPU is using ~800mb of memory. Additionally, nvprof will say the program used about 40mb of memory:

C:\Users\Syerjchep\source\repos\MyCuda\x64\Debug>nvprof ./MyCuda.exe
==15468== NVPROF is profiling process 15468, command: ./MyCuda.exe
Wow 1282703
==15468== Profiling application: ./MyCuda.exe
==15468== Warning: Found 49 invalid records in the result.
==15468== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==15468== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   57.46%  765.90ms     10000  76.589us  16.950us  167.90ms  cudaMallocManaged
                   39.69%  529.00ms     10000  52.899us  28.347us  1.4556ms  cudaFree
                    2.79%  37.125ms         1  37.125ms  37.125ms  37.125ms  cudaDeviceReset
                    0.05%  654.03us        45  14.533us     292ns  318.25us  cuDeviceGetAttribute
                    0.01%  163.65us         1  163.65us  163.65us  163.65us  cuDeviceGetName
                    0.00%  8.7670us         1  8.7670us  8.7670us  8.7670us  cuDeviceTotalMem
                    0.00%  2.6300us         3     876ns     292ns  2.0460us  cuDeviceGetCount
                    0.00%  1.4610us         2     730ns     292ns  1.1690us  cuDeviceGet

==15468== Unified Memory profiling result:
Device "GeForce GTX 980 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   10000  4.0000KB  4.0000KB  4.0000KB  39.06250MB  11.00649ms  Device To Host

Press any key to continue . . .

Not only is the discrepency between nvprof and my other diagnostics odd, but this means that each one of those objects is using between 4kb and 80kb of memory to store one byte of data. Is this amount of overhead normal?

(It should be noted that RAM usage is minimal and that if I set amount higher the program tends to just run out of GPU memory and crash.)

Regarding device memory usage, yes, its normal. There are minimum allocation sizes for managed data, it is equal to one page. The size of the page may vary, but the minimum size is I believe 4kbyte. 10k * 4kbyte = 40Mbyte

Regarding task manager/visual studio, what you’re referring to now is host memory. There are CUDA overheads associated with starting up CUDA and running your code that may contribute to the number. If you use a tool like nvidia-smi to look at device memory usage in a similar fashion (i.e. all of it) then you will also see that your program is using more than just 40Mbyte of device memory.