I have been converting a CUDA project to use managed memory instead of the classical host-device allocations.
Whilst testing and debugging the managed memory version of the project, I have discover memory leaks which I cannot understand where are they coming from. I though the CUDA driver was cleaning up all the allocations once a CUDA application is stopped.But this seems not to be the case in the case of managed memory.
In particular, I discovered that I was using memset and memcpy instructions in managed memory, and this was generating memory leaks. I modify these into the runtime API cudaMemset and cudaMemcpy and the memory leaks have disappeared.
Nevertheless, I then realized that there were other memory leaks associated with a call function to set a bit in an int array that has been allocated using managed memory.
I managed to reproduce the issue using a very simple program, see below. When debugging and closing the debug session just after calling set_bit_host I get a memory leak which is reported in nvidia-smi (there is basically an amount of memory that is not freed, around 8 MB). I was wondering if this could be a driver issue, and I have updated to the latest driver for my GPU card.
I am using a Tesla K20c on a Windows 7 64 bi system. My driver version is 385.08. The code has been compiled with CUDA Toolkit 8.0 GA2.
// CUDA kernel to add elements of two arrays
global
void add(int n, int *x, int *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
host device
void set_bit_host(int& n_num, const int n_bit)
{
n_num |= 1 << n_bit;
}
int main(void)
{
int N = 1 << 20;
int *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
HANDLE_ERROR(cudaMallocManaged(&x, N * sizeof(int)));
HANDLE_ERROR(cudaMallocManaged(&y, N * sizeof(int)));
// Allocate standard device linear memory
//HANDLE_ERROR(cudaMalloc(&x, N * sizeof(float)));
//HANDLE_ERROR(cudaMalloc(&y, N * sizeof(float)));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1;
y[i] = 2;
set_bit_host(y[i], 0);
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add << <numBlocks, blockSize >> >(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}