Host memory consumption spike with cudamalloc

Hello,

I have a GEForce 960M chip on a ubuntu 14.04 laptop.

The recursive function below allocates device memory. However, I see host memory spiking to 8G from 1.5G or so. (confirmed by running free -g). nvidia-smi shows device memory consumption increasing in parallel with host mem consumption until it hits 1.9G.

The host mem consumption disappears if the cudaMalloc lines are commented out. This rules out recursion as the cause of host memory spike.

------------------------------- CODE ----------------------------------------------

Node * copyTreeToGPU() {

if(hasChildren) {

// Copy Children to GPU
for(int i = 0; i < 8; i++) {
d_ptr[i] = ptr[i]->copyTreeToGPU();
}

// Copy self
Node *d_ptrtmp;
CHECK(cudaMalloc((Node **) &d_ptrtmp, sizeof(Node)));
//CHECK(cudaMemcpy(d_ptrtmp,this,sizeof(Node),cudaMemcpyHostToDevice));
copiedToGPU = true;
return d_ptrtmp;

} else {

// Copy self
Node *d_ptrtmp;
CHECK(cudaMalloc((Node **) &d_ptrtmp, sizeof(Node)));
//CHECK(cudaMemcpy(d_ptrtmp,this,sizeof(Node),cudaMemcpyHostToDevice));
copiedToGPU = true;
return d_ptrtmp;

}

} // copy from host to GPU

I don’t know the implementation of cudaMalloc(), but it would make sense to me to keep all metainformation about the allocations on the host (apart from device-side page tables setting aside large chunks of memory for host allocation). And if sizeof(Node) is only a few bytes, it is very reasonable if the metainformation actually takes up more space that the allocation itself.

To resolve your problem, allocate memory in larger chunks than individual nodes. Allocate large arrays of nodes at a time, and take nodes from there.
Ideally, get rid of the tree structure at all and use an array only, which usually gives much better performance.

Thank you for taking the time to respond.

A single chunk array transfer reduced local memory consumption significantly.

Not sure - why there should be any local memory increase with cudaMalloc.