Question about GPU Memory Overhead with Cudamallocmanaged

I’m rather new to CUDA and I was wondering I could receive some pointers relating to GPU memory allocation.
I have this very simple testing program here:

struct heh
{
	unsigned char *a;
	unsigned int b;
	heh(unsigned int _b);
	~heh();
};

heh::heh(unsigned int _b)
{
	b = _b;
	cudaError_t err = cudaMallocManaged(&a, b);
	if (err != cudaSuccess)
		printf("Error: %s\n", cudaGetErrorString(err));
}

heh::~heh()
{
	cudaFree(a);
}

int main()
{
	srand(5);
	vector<heh*> neato;

	unsigned int amount = 10000;
	unsigned int size = 1;

	for (unsigned int a = 0; a < amount; a++)
	{
		heh *mem = new heh(size);
		for (unsigned int b = 0; b < size; b++)
			mem->a[b] = rand() % 256;
		neato.push_back(mem);
	}

	unsigned long tot = 0;
	for (unsigned int q = 0; q < 10000; q++)
	{
		tot = q;
		for (unsigned int a = 0; a < amount; a++)
			for (unsigned int b = 0; b < size; b++)
				tot += neato[a]->a[b];
	} 
	printf("Wow %ld\n", tot);

	for (unsigned int a = 0; a < amount; a++)
		delete neato[a];

	cudaDeviceReset();

	return 0;
}

And I run it via nvprof. I run it two ways. The first being by setting size to 10000 and amount to 1, and the other by reversing the numbers. It’s to be expected that there’s more overhead with 10k objects 1 byte large as opposed to 1 ~10kb object. What I don’t understand however is that when I run it with 10k 1 byte objects task manager / visual studio will say my GPU is using ~800mb of memory. Additionally, nvprof will say the program used about 40mb of memory:

C:\Users\Syerjchep\source\repos\MyCuda\x64\Debug>nvprof ./MyCuda.exe
==15468== NVPROF is profiling process 15468, command: ./MyCuda.exe
Wow 1282703
==15468== Profiling application: ./MyCuda.exe
==15468== Warning: Found 49 invalid records in the result.
==15468== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==15468== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   57.46%  765.90ms     10000  76.589us  16.950us  167.90ms  cudaMallocManaged
                   39.69%  529.00ms     10000  52.899us  28.347us  1.4556ms  cudaFree
                    2.79%  37.125ms         1  37.125ms  37.125ms  37.125ms  cudaDeviceReset
                    0.05%  654.03us        45  14.533us     292ns  318.25us  cuDeviceGetAttribute
                    0.01%  163.65us         1  163.65us  163.65us  163.65us  cuDeviceGetName
                    0.00%  8.7670us         1  8.7670us  8.7670us  8.7670us  cuDeviceTotalMem
                    0.00%  2.6300us         3     876ns     292ns  2.0460us  cuDeviceGetCount
                    0.00%  1.4610us         2     730ns     292ns  1.1690us  cuDeviceGet

==15468== Unified Memory profiling result:
Device "GeForce GTX 980 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   10000  4.0000KB  4.0000KB  4.0000KB  39.06250MB  11.00649ms  Device To Host

C:\Users\Syerjchep\source\repos\MyCuda\x64\Debug>PAUSE
Press any key to continue . . .

Not only is the discrepency between nvprof and my other diagnostics odd, but this means that each one of those objects is using between 4kb and 80kb of memory to store one byte of data. Is this amount of overhead normal?

(It should be noted that RAM usage is minimal and that if I set amount higher the program tends to just run out of GPU memory and crash.)

Regarding device memory usage, yes, its normal. There are minimum allocation sizes for managed data, it is equal to one page. The size of the page may vary, but the minimum size is I believe 4kbyte. 10k * 4kbyte = 40Mbyte

Regarding task manager/visual studio, what you’re referring to now is host memory. There are CUDA overheads associated with starting up CUDA and running your code that may contribute to the number. If you use a tool like nvidia-smi to look at device memory usage in a similar fashion (i.e. all of it) then you will also see that your program is using more than just 40Mbyte of device memory.

please don’t post pictures of text on this forum

My bad, I deleted that reply and rewrote it here. Thanks for the notice.

I ran the same code in my RTX 4080 with cuda 12.6 and the results are as follows:

CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)    Operation
 ----------  -----  --------  --------  --------  --------  -----------  -------------
      0.010  10001     0.000     0.000     0.000     0.000        0.000  [CUDA memset]

Does this mean that the current allocation granularity is smaller? Or am I doing something wrong?
Thanks a lot!

Your results don’t say anything about the allocation granularity. There are 10000 operations in the main loop of the code. 10000x1byte = 0.010MB Evidently those operations are being handled via some sort of memset operation. A memset can operate on a single byte.

We further expect different behavior because the original posting from 6 years ago was on a pre-pascal GPU, which has a different UM implementation than a RTX GPU on linux.

When I profile the code on linux on my L4 GPU, I don’t get any cuda_gpu_mem_size_sum section:

# nsys  profile --stats=true ./t260
Wow 1280732
Generating '/tmp/nsys-report-4bcc.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /root/bobc/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)        Name 
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  --------------
     90.5    3,686,019,285        208  17,721,246.6  10,063,794.5     1,516  100,140,412  26,054,558.5  poll      
      9.3      378,729,103     50,671       7,474.3       1,162.0     1,027   79,495,597     466,309.0  ioctl     
      0.1        4,119,258         53      77,721.8      13,037.0    10,179    1,299,931     238,377.3  mmap64    
      0.0        1,624,656         25      64,986.2      13,453.0     2,556      356,008     107,754.1  mmap      
      0.0          787,729         18      43,762.7      43,466.5    24,655       77,695      14,062.7  sem_timedwait
      0.0          786,727         44      17,880.2      17,166.5     5,760       28,417       4,967.6  open64    
      0.0          444,757         13      34,212.1       6,564.0     3,463      326,011      88,281.4  munmap    
      0.0          280,513          2     140,256.5     140,256.5   117,584      162,929      32,063.8  pthread_create
      0.0          210,329          1     210,329.0     210,329.0   210,329      210,329           0.0  pthread_join
      0.0          195,452         29       6,739.7       5,580.0     2,014       19,042       3,982.2  fopen     
      0.0           84,174         49       1,717.8          65.0        60       80,877      11,544.0  fgets     
      0.0           66,451         23       2,889.2       2,989.0     1,543        4,105         686.0  fclose    
      0.0           55,133         51       1,081.0       1,055.0       720        2,135         223.9  fcntl     
      0.0           54,431         20       2,721.6       2,839.0     1,432        5,320       1,001.9  write     
      0.0           53,043         23       2,306.2       2,070.0     1,267        4,860         942.4  read      
      0.0           39,833          6       6,638.8       6,899.5     2,648       10,661       2,795.9  open      
      0.0           19,669          2       9,834.5       9,834.5     5,178       14,491       6,585.3  socket    
      0.0           15,590          1      15,590.0      15,590.0    15,590       15,590           0.0  connect   
      0.0           13,643          1      13,643.0      13,643.0    13,643       13,643           0.0  fread     
      0.0            9,694          1       9,694.0       9,694.0     9,694        9,694           0.0  pipe2     
      0.0            6,346          7         906.6         888.0       805        1,015          72.3  dup       
      0.0            2,613          1       2,613.0       2,613.0     2,613        2,613           0.0  bind      
      0.0            1,662          1       1,662.0       1,662.0     1,662        1,662           0.0  listen    
      0.0              649         10          64.9          53.0        48          178          39.9  fflush    

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  ----------------------
     50.1      282,275,737     10,000       28,227.6        6,823.0        4,359  206,274,397  2,062,668.8  cudaMallocManaged
     26.6      149,803,327     10,000       14,980.3       14,270.0       11,383      553,507      9,720.1  cudaFree
     23.3      131,318,138          1  131,318,138.0  131,318,138.0  131,318,138  131,318,138          0.0  cudaDeviceReset
      0.0           17,597          1       17,597.0       17,597.0       17,597       17,597          0.0  cuCtxSynchronize
      0.0            1,453          1        1,453.0        1,453.0        1,453        1,453          0.0  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: /root/bobc/report2.sqlite does not contain CUDA kernel data.
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: /root/bobc/report2.sqlite does not contain GPU memory data.
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: /root/bobc/report2.sqlite does not contain GPU memory data.
Generated:
    /root/bobc/report2.nsys-rep
    /root/bobc/report2.sqlite
#

Are you on windows?

Yes, I’m on Windows.

I actually try to find the answer to:
when allocating multiple data smaller than 4KB, will each data be allocated into separate 4KB blocks, or will two 1-byte data be allocated into the same 4KB block?

Windows doesn’t use demand-paging of data.

Separate 1-byte allocations will not be grouped together. Beyond that, I’m not sure the windows UM behavior is specified anywhere to this degree.

On the other hand, if you decide to use cudaMalloc, the answers are discoverable with ordinary testing. With cudaMallocManaged, you might be able to run a test on windows to find out granularity or overhead by doing successive one-byte allocations until you hit an out of memory error. Then divide the used memory by the number of allocations. That is effectively what you can do with cudaMalloc.

Got it!
Thank you so much for the clear and detailed reply :D