GPU Memory monitoring

Jazo · November 12, 2009, 6:34am

I’m trying to monitor memory in GPU during data copy and initialization before kernel execution:

[codebox]

typedef struct node_ {

int sum_x;

int sum_y;

int sum_x_2;

int sum_y_2;

int sum_x_y;

int n;

} node;

…

node *_node = (node *) malloc(sizeof(node));

_node → sum_x = 0;

_node → sum_y = 0;

_node → sum_x_2 = 0;

_node → sum_y_2 = 0;

_node → sum_x_y = 0;

_node → n = 0;

…

CUdevice device;

CUcontext context;

cuInit(0);

cuDeviceGet(&device, 0);

cuCtxCreate(&context, 0, device);

cuCtxPopCurrent(&context);

uint free, total;

for (int index = 0; index < hash_table_size; index++) {

cuCtxPushCurrent(context);

cuMemGetInfo(&free, &total);

cuCtxPopCurrent(&context);

printf("GPU Memory status: %10d %10d\n", free, total);

node *__node;

	

cudaError = cudaMalloc((void **) &__node, sizeof(node));

	

cudaError = cudaMemcpy(__node, _node, sizeof(node), cudaMemcpyHostToDevice);

	

host_hash_table[index] = __node;

}

cuCtxDetach(context);

[/codebox]

Output:

[codebox]

GPU Memory status: 499240960 536543232

GPU Memory status: 473931520 536543232

…

[/codebox]

Questions:

Why in the first step of the loop memory allocated 499240960 - 473931520 = 25309440 ~ 24 Mb, but the sizeof(node) = 24 bytes?
Why in the next steps memory don’t allocated, but cudaMalloc executed every step of the loop must allocated 24 bytes?
If i set hash_table_size ~ 100 000 - 500 000, in finally loop rapidly allocate ~ 200 - 300 Mb memory and every next step decrease count of free memory ~ 1Mb and obviously memory overflow over 512 Mb, but even 1 000 000 hash_table_size (24 000 000 bytes of 24 byte struct) is the ~ 24 Mb memory, why?

P.S.: I’m know about data align and coalescing memory in GPU, and know that array of struct more effective to separate to different arrays, but this sample i’m trying to execute as experiment during learning CUDA.

avidday · November 12, 2009, 7:12am

I think if you check the free memory before the first allocation rather than after it, you will find that the initial 24Mb is already assigned - it is effectively overhead associated with the CUDA context and some pre-allocated memory.

When I looked at this a while ago, I found that CUDA preallocates a 16Mb block from user allocations can be made.

After the initial 16Mb allocation, there are a small number of 4kb pages, and then the rest of memory is mapped in 64kb pages. If you allocated 24b when the device is working with 64kb pages, 64kb is used. You can read more about this in this post..

Topic		Replies	Views
show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime CUDA Programming and Performance	4	2132	December 13, 2016
sth wierd about managed memory and free GPU memeory size CUDA Programming and Performance	2	643	November 25, 2019
Question on working of CUDA Unified Memory CUDA Programming and Performance cuda	1	556	December 6, 2021
Global memory access bottleneck CUDA Programming and Performance	8	3385	September 4, 2015
Memory allocation : strange behavior CUDA Programming and Performance	4	2554	March 4, 2008
Cuda Memory Usage TX1 Jetson TX1	8	4525	December 16, 2015
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4642	April 23, 2009
bug in memory allocation? CUDA Programming and Performance	6	4151	May 24, 2012
How to clear my GPU memory?? CUDA Programming and Performance	9	40481	July 7, 2017
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007

GPU Memory monitoring

Related topics