FAO: Nvidia Engineers:- Memory Leak in cudaMemcpyAsync Only occurs on Host To Device memory transfer

Hi there,

I had noticed a memory leak on my test platform whereby system memory usage (as shown in top) decreases when my CUDA application is active. However, the memory lost is NOT attributed to the process in question.

To cut a long investigative story short, the below points summarise the conditions under which the leak exhibits itself and those conditions under which it does not. A full source code example is provided at the bottom for anyone wishing to test/replicate this issue in their own environment - however I must stress that the code was transposed from a standalone server and therefore may have typo’s or compilation errors, my apologies if it does.

Note: All tests were performed on a Centos 5.4 server with a GTX 285 GPU (it has also been observed on a equally kickstarted Redhat 5.4 distribution)

Memory leak occurs under the following conditions:

  1. cudaMemcpyAsync is used (streamed)

  2. direction of copy is cudaMemcpyHostToDevice

  3. NVIDIA driver version 195.36.15 or 256.40 is used w/ CUDA 3.0 or 3.1 (see note in below section ref: other driver versions tested)

Memory leak does not occur if any of the below statements are true:

  1. cudaMemcpy is used (synchronously / un-streamed)

  2. direction of copy is cudaMemcpyDeviceToHost

  3. NVIDIA driver version is 190.18 w/ CUDA 2.3 (note: 190.18, 195.36.15 and 256.40 were the only drivers I had available to test with - this is not a complete list)

The test setup can be seen in the sample code, but is summarised below for quick reference:

  1. a block of page-locked host memory is allocated for each stream (note: this memory is uninitialised as it is irrelevant for the purposes of this test)

  2. the 2 streams are created

  3. device memory is allocated for each stream

  4. the following three instructions are present in a loop of order N

4a) copy the input memory for stream[i] to device memory of stream[i] using cudaMemcpyAsync (repeat for each stream)

4b) run a kernel for each stream - the kernel is an empty function which does nothing (doesn’t need to be there but provides an example of how one could double-buffer input to increase throughput)

4c) copy the same memory back from the device to the host (again, repeat for each stream)

Variables with NO impact:

Data Size: The data size of the memory being copied has no impact on the value BytesLeakedPerPass (this value is can be divided by the number of streams to give you BytesLeakedPerAsyncCall).

Stream Creation: Calling cudaStreamCreate and cudaStreamDestroy inside the loop doesn’t affect the leak.

The results of the tests I conducted are given below:

Test 1

Memory size of data: 512KB

number of passes: 65536

BytesLeakedPerPass: 244

Test 2

Memory size of data: 512KB

number of passes: 65536

BytesLeakedPerPass: 235

Test 3

Memory size of data: 512KB

number of passes: 65536

BytesLeakedPerPass: 241

Test 4

Memory size of data: 128KB

number of passes: 65536

BytesLeakedPerPass: 222

Test 5

Memory size of data: 128KB

number of passes: 65536

BytesLeakedPerPass: 243

Test 4

Memory size of data: 128KB

number of passes: 65536

BytesLeakedPerPass: 241

It is perhaps obvious to say this but it is worth stating, if you wrap the entire program in a bash script (while true; do; ./cuda_test 1; done) the leak does not occur - since initialisation, startup and teardown occurs every time this is expected.

That pretty much sums out what I’ve managed to uncover about this issue.

I hope that this information is of use and allows you to identify and resolve the issue. Thankfully in my environment I do not need CUDA 3.x features and can safely use 190.18 w/ CUDA 2.3 however I’m sure there are many others who aren’t so fortunate - particularly if this is deployed in (pseudo) mission-critical environments.

Thanks for your time, the source code is below (again apologies if this has typos)

//Full code listing of sample program which exhibits the bug:

//Note: Code taken from simpleTexture example so some includes may not be required.

//Note: Code transposed from standalone server, typos may be present.

//copy this file into the ${CUDA_SDK_ROOT}/C/src/cuda_mem_test directory to compile

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <cutil_inline.h>

//prototypes

void runTestAsync(int argc, char** argv);

void runTestSync(int argc, char** argv);

//kernel

__global__ void

myKernel()

{

}

int main(int argc, char** argv)

{

  runTestAsync(argc, argv);

  cutilExit(argc, argv);

}

void runTestAsync(int argc, char** argv)

{

  cudaSetDevice(cutGetMaxGflopsDeviceId());

int loopCount = 10000;

  if (argc > 1)

  {

	loopCount = atoi(argv[1]);

  }

  const u_int32_t NUM_STREAMS = 2;

  const u_int32_t DATA_SIZE = 512*1024; //512KB memory size, although this doesn't actually matter for this test

//allocate some host memory space

  u_int8_t* h_data = NULL;

  cutilSafeCall(cudaMallocHost((void**)&h_data, NUM_STREAMS*DATA_SIZE));

//create two streams

  cudaStream_t streams[NUM_STREAMS];

  for (u_int32_t i = 0; i < NUM_STREAMS; i++)

  {

	 cudaStreamCreate(&streams[i]);

  }

//allocate some device memory space

  u_int8_t* d_data = NULL;

  cutilSafeCall(cudaMalloc((void**)&d_data, NUM_STREAMS*DATA_SIZE));

//read system memory usage before test (in  bytes - ensure nothing else is actively running when doing this test)

  system("free -b");

//repeat N times

  for (u_int32_t i = 0; i < loopCount; i++)

  {

	 //perform async memory transfer, kernel execution, and copy back

	 //in this case the data is irrelevant, kernel consists of noop and input is cloned back as output

	 for (u_int32_t j = 0; j < NUM_STREAMS; j++)

	 {

	   cudaMemcpyAsync(d_data + j*DATA_SIZE, h_data + j*DATA_SIZE, DATA_SIZE, cudaMemcpyHostToDevice, streams[j]);

	   cutilCheckMsg("memcpy host to device failed");

	 }

	 for (u_int32_t j = 0; j < NUM_STREAMS; j++)

	 {

	   //invoke null kernel, first two parameters are irrelevant

	   myKernel<<<100, 512, 0, streams[j]>>>();

	   cutilCheckMsg("kernel execution failed");

	 }

	 for (u_int32_t j = 0; j < NUM_STREAMS; j++)

	 {

	   cudaMemcpyAsync(h_data + j*DATA_SIZE, d_data + j*DATA_SIZE, DATA_SIZE, cudaMemcpyDeviceToHost, streams[j]);

	   cutilCheckMsg("memcpy device to host failed");

	 }  

  }

cutilSafeCall(cudaThreadSynchronize());

//read memory usage after run

  system("free -b");

cudaFree(h_data);

  cudaFree(d_data);   

cudaThreadExit();

}

void runTestSync(int argc, char** argv)

{

  cudaSetDevice(cutGetMaxGflopsDeviceId());

int loopCount = 10000;

  if (argc > 1)

  {

	loopCount = atoi(argv[1]);

  }

  const u_int32_t NUM_STREAMS = 2;

  const u_int32_t DATA_SIZE = 512*1024; //512KB memory size, although this doesn't actually matter for this test

//allocate some host memory space

  u_int8_t* h_data = NULL;

  cutilSafeCall(cudaMallocHost((void**)&h_data, DATA_SIZE));

//allocate some device memory space

  u_int8_t* d_data = NULL;

  cutilSafeCall(cudaMalloc((void**)&d_data, DATA_SIZE));

//read system memory usage before test (in  bytes - ensure nothing else is actively running when doing this test)

  system("free -b");

//repeat N times

  for (u_int32_t i = 0; i < loopCount; i++)

  {

	 //perform sync memory transfer, kernel execution, and copy back

	 //in this case the data is irrelevant, kernel consists of noop and input is cloned back as output

	 cudaMemcpy(d_data, h_data, DATA_SIZE, cudaMemcpyHostToDevice);

	 cutilCheckMsg("memcpy host to device failed");

			 

	 //invoke null kernel, first two parameters are irrelevant

	 myKernel<<<100, 512, 0>>>();

	 cutilCheckMsg("kernel execution failed");

	 

	 cudaMemcpy(h_data, d_data + j, DATA_SIZE, cudaMemcpyDeviceToHost);

	 cutilCheckMsg("memcpy device to host failed");   

  }

cutilSafeCall(cudaThreadSynchronize());

//read memory usage after run

  system("free -b");

cudaFree(h_data);

  cudaFree(d_data);   

cudaThreadExit();

}

//Only on CUDA 3.0 w/ driver 195.36.15 and NOT on CUDA 2.3 w/ driver 190.18

Michael.

Already fixed in the next driver release. Thanks for the very useful bug report, though!

Already fixed in the next driver release. Thanks for the very useful bug report, though!

Not a problem, just so I can keep my eye out - what is the next driver release version number going to be and what is the expected release date?
Thanks for the prompt reply - I would have had the report up sooner but it took a day to receive my account activation email!
I’ll be sure to post anything else I find, although I must say this is the first problem I’ve had in 2 years of using CUDA (other than examples in documentation not being right but they suffer from exactly the same problem as my sample code did here - transposition!).
Thanks again.

Michael.

Not a problem, just so I can keep my eye out - what is the next driver release version number going to be and what is the expected release date?
Thanks for the prompt reply - I would have had the report up sooner but it took a day to receive my account activation email!
I’ll be sure to post anything else I find, although I must say this is the first problem I’ve had in 2 years of using CUDA (other than examples in documentation not being right but they suffer from exactly the same problem as my sample code did here - transposition!).
Thanks again.

Michael.