I’m trying to monitor memory in GPU during data copy and initialization before kernel execution:
[codebox]
typedef struct node_ {
int sum_x;
int sum_y;
int sum_x_2;
int sum_y_2;
int sum_x_y;
int n;
} node;
…
node *_node = (node *) malloc(sizeof(node));
_node → sum_x = 0;
_node → sum_y = 0;
_node → sum_x_2 = 0;
_node → sum_y_2 = 0;
_node → sum_x_y = 0;
_node → n = 0;
…
CUdevice device;
CUcontext context;
cuInit(0);
cuDeviceGet(&device, 0);
cuCtxCreate(&context, 0, device);
cuCtxPopCurrent(&context);
uint free, total;
for (int index = 0; index < hash_table_size; index++) {
cuCtxPushCurrent(context);
cuMemGetInfo(&free, &total);
cuCtxPopCurrent(&context);
printf("GPU Memory status: %10d %10d\n", free, total);
node *__node;
cudaError = cudaMalloc((void **) &__node, sizeof(node));
cudaError = cudaMemcpy(__node, _node, sizeof(node), cudaMemcpyHostToDevice);
host_hash_table[index] = __node;
}
cuCtxDetach(context);
[/codebox]
Output:
[codebox]
GPU Memory status: 499240960 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
GPU Memory status: 473931520 536543232
…
[/codebox]
Questions:
-
Why in the first step of the loop memory allocated 499240960 - 473931520 = 25309440 ~ 24 Mb, but the sizeof(node) = 24 bytes?
-
Why in the next steps memory don’t allocated, but cudaMalloc executed every step of the loop must allocated 24 bytes?
-
If i set hash_table_size ~ 100 000 - 500 000, in finally loop rapidly allocate ~ 200 - 300 Mb memory and every next step decrease count of free memory ~ 1Mb and obviously memory overflow over 512 Mb, but even 1 000 000 hash_table_size (24 000 000 bytes of 24 byte struct) is the ~ 24 Mb memory, why?
P.S.: I’m know about data align and coalescing memory in GPU, and know that array of struct more effective to separate to different arrays, but this sample i’m trying to execute as experiment during learning CUDA.