Hi there,
I had noticed a memory leak on my test platform whereby system memory usage (as shown in top) decreases when my CUDA application is active. However, the memory lost is NOT attributed to the process in question.
To cut a long investigative story short, the below points summarise the conditions under which the leak exhibits itself and those conditions under which it does not. A full source code example is provided at the bottom for anyone wishing to test/replicate this issue in their own environment - however I must stress that the code was transposed from a standalone server and therefore may have typo’s or compilation errors, my apologies if it does.
Note: All tests were performed on a Centos 5.4 server with a GTX 285 GPU (it has also been observed on a equally kickstarted Redhat 5.4 distribution)
Memory leak occurs under the following conditions:
-
cudaMemcpyAsync is used (streamed)
-
direction of copy is cudaMemcpyHostToDevice
-
NVIDIA driver version 195.36.15 or 256.40 is used w/ CUDA 3.0 or 3.1 (see note in below section ref: other driver versions tested)
Memory leak does not occur if any of the below statements are true:
-
cudaMemcpy is used (synchronously / un-streamed)
-
direction of copy is cudaMemcpyDeviceToHost
-
NVIDIA driver version is 190.18 w/ CUDA 2.3 (note: 190.18, 195.36.15 and 256.40 were the only drivers I had available to test with - this is not a complete list)
The test setup can be seen in the sample code, but is summarised below for quick reference:
-
a block of page-locked host memory is allocated for each stream (note: this memory is uninitialised as it is irrelevant for the purposes of this test)
-
the 2 streams are created
-
device memory is allocated for each stream
-
the following three instructions are present in a loop of order N
4a) copy the input memory for stream[i] to device memory of stream[i] using cudaMemcpyAsync (repeat for each stream)
4b) run a kernel for each stream - the kernel is an empty function which does nothing (doesn’t need to be there but provides an example of how one could double-buffer input to increase throughput)
4c) copy the same memory back from the device to the host (again, repeat for each stream)
Variables with NO impact:
Data Size: The data size of the memory being copied has no impact on the value BytesLeakedPerPass (this value is can be divided by the number of streams to give you BytesLeakedPerAsyncCall).
Stream Creation: Calling cudaStreamCreate and cudaStreamDestroy inside the loop doesn’t affect the leak.
The results of the tests I conducted are given below:
Test 1
Memory size of data: 512KB
number of passes: 65536
BytesLeakedPerPass: 244
Test 2
Memory size of data: 512KB
number of passes: 65536
BytesLeakedPerPass: 235
Test 3
Memory size of data: 512KB
number of passes: 65536
BytesLeakedPerPass: 241
Test 4
Memory size of data: 128KB
number of passes: 65536
BytesLeakedPerPass: 222
Test 5
Memory size of data: 128KB
number of passes: 65536
BytesLeakedPerPass: 243
Test 4
Memory size of data: 128KB
number of passes: 65536
BytesLeakedPerPass: 241
It is perhaps obvious to say this but it is worth stating, if you wrap the entire program in a bash script (while true; do; ./cuda_test 1; done) the leak does not occur - since initialisation, startup and teardown occurs every time this is expected.
That pretty much sums out what I’ve managed to uncover about this issue.
I hope that this information is of use and allows you to identify and resolve the issue. Thankfully in my environment I do not need CUDA 3.x features and can safely use 190.18 w/ CUDA 2.3 however I’m sure there are many others who aren’t so fortunate - particularly if this is deployed in (pseudo) mission-critical environments.
Thanks for your time, the source code is below (again apologies if this has typos)
//Full code listing of sample program which exhibits the bug:
//Note: Code taken from simpleTexture example so some includes may not be required.
//Note: Code transposed from standalone server, typos may be present.
//copy this file into the ${CUDA_SDK_ROOT}/C/src/cuda_mem_test directory to compile
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cutil_inline.h>
//prototypes
void runTestAsync(int argc, char** argv);
void runTestSync(int argc, char** argv);
//kernel
__global__ void
myKernel()
{
}
int main(int argc, char** argv)
{
runTestAsync(argc, argv);
cutilExit(argc, argv);
}
void runTestAsync(int argc, char** argv)
{
cudaSetDevice(cutGetMaxGflopsDeviceId());
int loopCount = 10000;
if (argc > 1)
{
loopCount = atoi(argv[1]);
}
const u_int32_t NUM_STREAMS = 2;
const u_int32_t DATA_SIZE = 512*1024; //512KB memory size, although this doesn't actually matter for this test
//allocate some host memory space
u_int8_t* h_data = NULL;
cutilSafeCall(cudaMallocHost((void**)&h_data, NUM_STREAMS*DATA_SIZE));
//create two streams
cudaStream_t streams[NUM_STREAMS];
for (u_int32_t i = 0; i < NUM_STREAMS; i++)
{
cudaStreamCreate(&streams[i]);
}
//allocate some device memory space
u_int8_t* d_data = NULL;
cutilSafeCall(cudaMalloc((void**)&d_data, NUM_STREAMS*DATA_SIZE));
//read system memory usage before test (in bytes - ensure nothing else is actively running when doing this test)
system("free -b");
//repeat N times
for (u_int32_t i = 0; i < loopCount; i++)
{
//perform async memory transfer, kernel execution, and copy back
//in this case the data is irrelevant, kernel consists of noop and input is cloned back as output
for (u_int32_t j = 0; j < NUM_STREAMS; j++)
{
cudaMemcpyAsync(d_data + j*DATA_SIZE, h_data + j*DATA_SIZE, DATA_SIZE, cudaMemcpyHostToDevice, streams[j]);
cutilCheckMsg("memcpy host to device failed");
}
for (u_int32_t j = 0; j < NUM_STREAMS; j++)
{
//invoke null kernel, first two parameters are irrelevant
myKernel<<<100, 512, 0, streams[j]>>>();
cutilCheckMsg("kernel execution failed");
}
for (u_int32_t j = 0; j < NUM_STREAMS; j++)
{
cudaMemcpyAsync(h_data + j*DATA_SIZE, d_data + j*DATA_SIZE, DATA_SIZE, cudaMemcpyDeviceToHost, streams[j]);
cutilCheckMsg("memcpy device to host failed");
}
}
cutilSafeCall(cudaThreadSynchronize());
//read memory usage after run
system("free -b");
cudaFree(h_data);
cudaFree(d_data);
cudaThreadExit();
}
void runTestSync(int argc, char** argv)
{
cudaSetDevice(cutGetMaxGflopsDeviceId());
int loopCount = 10000;
if (argc > 1)
{
loopCount = atoi(argv[1]);
}
const u_int32_t NUM_STREAMS = 2;
const u_int32_t DATA_SIZE = 512*1024; //512KB memory size, although this doesn't actually matter for this test
//allocate some host memory space
u_int8_t* h_data = NULL;
cutilSafeCall(cudaMallocHost((void**)&h_data, DATA_SIZE));
//allocate some device memory space
u_int8_t* d_data = NULL;
cutilSafeCall(cudaMalloc((void**)&d_data, DATA_SIZE));
//read system memory usage before test (in bytes - ensure nothing else is actively running when doing this test)
system("free -b");
//repeat N times
for (u_int32_t i = 0; i < loopCount; i++)
{
//perform sync memory transfer, kernel execution, and copy back
//in this case the data is irrelevant, kernel consists of noop and input is cloned back as output
cudaMemcpy(d_data, h_data, DATA_SIZE, cudaMemcpyHostToDevice);
cutilCheckMsg("memcpy host to device failed");
//invoke null kernel, first two parameters are irrelevant
myKernel<<<100, 512, 0>>>();
cutilCheckMsg("kernel execution failed");
cudaMemcpy(h_data, d_data + j, DATA_SIZE, cudaMemcpyDeviceToHost);
cutilCheckMsg("memcpy device to host failed");
}
cutilSafeCall(cudaThreadSynchronize());
//read memory usage after run
system("free -b");
cudaFree(h_data);
cudaFree(d_data);
cudaThreadExit();
}
//Only on CUDA 3.0 w/ driver 195.36.15 and NOT on CUDA 2.3 w/ driver 190.18
Michael.