GPU Memory monitoring

I’m trying to monitor memory in GPU during data copy and initialization before kernel execution:

[codebox]

typedef struct node_ {

int sum_x;

int sum_y;

int sum_x_2;

int sum_y_2;

int sum_x_y;

int n;

} node;

node *_node = (node *) malloc(sizeof(node));

_node → sum_x = 0;

_node → sum_y = 0;

_node → sum_x_2 = 0;

_node → sum_y_2 = 0;

_node → sum_x_y = 0;

_node → n = 0;

CUdevice device;

CUcontext context;

cuInit(0);

cuDeviceGet(&device, 0);

cuCtxCreate(&context, 0, device);

cuCtxPopCurrent(&context);

uint free, total;

for (int index = 0; index < hash_table_size; index++) {

cuCtxPushCurrent(context);

cuMemGetInfo(&free, &total);

cuCtxPopCurrent(&context);

printf("GPU Memory status: %10d %10d\n", free, total);

node *__node;

	

cudaError = cudaMalloc((void **) &__node, sizeof(node));

	

cudaError = cudaMemcpy(__node, _node, sizeof(node), cudaMemcpyHostToDevice);

	

host_hash_table[index] = __node;

}

cuCtxDetach(context);

[/codebox]

Output:

[codebox]

GPU Memory status: 499240960 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

GPU Memory status: 473931520 536543232

[/codebox]

Questions:

  1. Why in the first step of the loop memory allocated 499240960 - 473931520 = 25309440 ~ 24 Mb, but the sizeof(node) = 24 bytes?

  2. Why in the next steps memory don’t allocated, but cudaMalloc executed every step of the loop must allocated 24 bytes?

  3. If i set hash_table_size ~ 100 000 - 500 000, in finally loop rapidly allocate ~ 200 - 300 Mb memory and every next step decrease count of free memory ~ 1Mb and obviously memory overflow over 512 Mb, but even 1 000 000 hash_table_size (24 000 000 bytes of 24 byte struct) is the ~ 24 Mb memory, why?

P.S.: I’m know about data align and coalescing memory in GPU, and know that array of struct more effective to separate to different arrays, but this sample i’m trying to execute as experiment during learning CUDA.

I think if you check the free memory before the first allocation rather than after it, you will find that the initial 24Mb is already assigned - it is effectively overhead associated with the CUDA context and some pre-allocated memory.

When I looked at this a while ago, I found that CUDA preallocates a 16Mb block from user allocations can be made.

After the initial 16Mb allocation, there are a small number of 4kb pages, and then the rest of memory is mapped in 64kb pages. If you allocated 24b when the device is working with 64kb pages, 64kb is used. You can read more about this in this post..