Memory leak using managed memory

I have been converting a CUDA project to use managed memory instead of the classical host-device allocations.
Whilst testing and debugging the managed memory version of the project, I have discover memory leaks which I cannot understand where are they coming from. I though the CUDA driver was cleaning up all the allocations once a CUDA application is stopped.But this seems not to be the case in the case of managed memory.

In particular, I discovered that I was using memset and memcpy instructions in managed memory, and this was generating memory leaks. I modify these into the runtime API cudaMemset and cudaMemcpy and the memory leaks have disappeared.

Nevertheless, I then realized that there were other memory leaks associated with a call function to set a bit in an int array that has been allocated using managed memory.

I managed to reproduce the issue using a very simple program, see below. When debugging and closing the debug session just after calling set_bit_host I get a memory leak which is reported in nvidia-smi (there is basically an amount of memory that is not freed, around 8 MB). I was wondering if this could be a driver issue, and I have updated to the latest driver for my GPU card.

I am using a Tesla K20c on a Windows 7 64 bi system. My driver version is 385.08. The code has been compiled with CUDA Toolkit 8.0 GA2.

// CUDA kernel to add elements of two arrays
void add(int n, int *x, int *y)
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];

host device
void set_bit_host(int& n_num, const int n_bit)
n_num |= 1 << n_bit;

int main(void)
int N = 1 << 20;
int *x, *y;

// Allocate Unified Memory -- accessible from CPU or GPU
HANDLE_ERROR(cudaMallocManaged(&x, N * sizeof(int)));
HANDLE_ERROR(cudaMallocManaged(&y, N * sizeof(int)));

// Allocate standard device linear memory
//HANDLE_ERROR(cudaMalloc(&x, N * sizeof(float)));
//HANDLE_ERROR(cudaMalloc(&y, N * sizeof(float)));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
	x[i] = 1;
	y[i] = 2;
	set_bit_host(y[i], 0);

// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add << <numBlocks, blockSize >> >(N, x, y);

// Wait for GPU to finish before accessing on host

// Free memory

return 0;


Thanks txbob for your message. I do not know why, apparently the message has disappeared from the forum.

If you look at the code that I have posted, you will see that the pointers x and y are not allocated twice, the code with cudaMalloc is commented out. This is just a simple code that I have been using to recreate the problem that I was experiencing in a larger project. As you can see, I have used the sample code from the source that you have recommended.

I fully understand that I can only allocate once on a pointer, either with cudaMalloc or cudaMallocManaged.

The issue I am having is that exiting the code that uses managed memory before freeing the allocations generates a memory leak that is reported in nvidia-smi. Using non managed memory, even if I stop the program before freeing the allocation there is no memory leak.

I removed my message because BulatZiganshin pointed out that it didn’t make sense (as did you): those lines were commented out, I had missed that.

You may want to report this as a bug at