Memory leak running CUDA C program

I have discover memory leaks while running a CUDA C program. I reproduce the problem with a very simple program.
Because I will continuously receive different data, so I want CUDA program can always be running.
In each loop, I need to copy the data to the GPU, and then use several kernel functions for calculation.

I compile the code using the following command:
nvcc -o test2 test.cu

During the program run, the memory used by the process increasing. Why would this happen?

The version of CUDA:
ba25b5034b9519a41a1f1b7713ce8cd

Hi,

We will give it a try and provide more info to you later.
Thanks.

Hi,

Could you attach the test.cu source so we can give it a try?
Thanks.

include <stdio.h>
include <unistd.h>
include <stdlib.h>
include <string.h>
include <cuda_runtime.h>

global void Multip(float *a, float *b, float *c) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;

c[idx] = a[idx] * b[idx];

}

global void Log(float *c) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;

c[idx] = log10f(c[idx] * 5);

}

int main() {
int n = 512000;
int i = 0;
float *a, *b;
float *d_a, *d_b, *d_c;

cudaMalloc((void **)&d_a, n * sizeof(float));
cudaMalloc((void **)&d_b, n * sizeof(float));
cudaMalloc((void **)&d_c, n * sizeof(float));

a = (float *)malloc(n * sizeof(float));
b = (float *)malloc(n * sizeof(float));

for (i = 0; i < n; i++) {
    a[i] = i;
    b[i] = i + 1;
}

while(1){

    cudaMemcpy(d_a, a, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, n * sizeof(float), cudaMemcpyHostToDevice);
    Multip<<<1000, 512>>>(d_a, d_b, d_c);
    Log<<<1000, 512>>>(d_c);
    usleep(200000);
}

free(a);
free(b);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;

}

Hi,

We are not able to reproduce this issue in our environment.
Could you double-check on your side as well?

Thanks.

Hi,

We tested this again and found that memory does increase a little in the beginning.
But after that, memory usage remains the same for hours.

Could you also help confirm whether you see a similar behavior on your side?
This should be expected and not a bug.

Start:

$ ps aux | grep test2
nvidia      6233  0.8  0.3 8037028 26876 pts/0   Sl+  02:53   0:00 ./test2
nvidia      6276  0.0  0.0   9016  1840 pts/1    S+   02:53   0:00 grep --color=auto test2
$ sudo tegrastats 
12-10-2024 02:54:07 RAM 1510/7620MB (lfb 4x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@43.5C soc2@42.843C soc0@40.562C gpu@43.5C tj@43.5C soc1@42.062C VDD_IN 4654mW/4654mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW
12-10-2024 02:54:08 RAM 1510/7620MB (lfb 4x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@43.343C soc2@42.843C soc0@40.656C gpu@43.625C tj@43.625C soc1@42.25C VDD_IN 4654mW/4654mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW

After 1 hour:

$ ps aux | grep test2
nvidia      6233  0.3  0.4 8037028 31232 pts/0   Sl+  02:53   0:13 ./test2
nvidia      6404  0.0  0.0   9016  1852 pts/1    S+   03:55   0:00 grep --color=auto test2
$ sudo tegrastats 
12-10-2024 03:55:16 RAM 1520/7620MB (lfb 4x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@44.843C soc2@44.312C soc0@42.156C gpu@44.906C tj@44.906C soc1@43.687C VDD_IN 4654mW/4654mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW
12-10-2024 03:55:17 RAM 1520/7620MB (lfb 4x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@45C soc2@44.375C soc0@42.218C gpu@45.031C tj@45C soc1@43.468C VDD_IN 4654mW/4654mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW

After 2 hours:

$ ps aux | grep test2
nvidia      6233  0.3  0.4 8037028 31232 pts/0   Sl+  02:53   0:25 ./test2
nvidia      6477  0.0  0.0   9016  1852 pts/1    S+   04:56   0:00 grep --color=auto test2
$ sudo tegrastats  
12-10-2024 04:56:24 RAM 1523/7620MB (lfb 5x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@45.062C soc2@44.281C soc0@41.906C gpu@44.906C tj@45.062C soc1@43.468C VDD_IN 4654mW/4654mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW
12-10-2024 04:56:25 RAM 1523/7620MB (lfb 5x4MB) SWAP 0/3810MB (cached 0MB) CPU [3%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@44.812C soc2@44.187C soc0@42.062C gpu@44.812C tj@44.812C soc1@43.718C VDD_IN 4694mW/4674mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW

After 3 hours

$ ps aux | grep test2
nvidia      6233  0.3  0.4 8037028 31232 pts/0   Sl+  02:53   0:39 ./test2
nvidia      6567  0.0  0.0   9016  1844 pts/1    S+   06:09   0:00 grep --color=auto test2
$ sudo tegrastats 
12-10-2024 06:09:59 RAM 1521/7620MB (lfb 4x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,1%@1510,0%@1510,0%@1510] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@44.968C soc2@44.312C soc0@42.156C gpu@45.281C tj@45.281C soc1@43.406C VDD_IN 4654mW/4654mW VDD_CPU_GPU_CV 963mW/963mW VDD_SOC 1484mW/1484mW
12-10-2024 06:10:00 RAM 1521/7620MB (lfb 4x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1510,0%@1510,0%@1510,0%@1510,0%@1510,0%@1510]

Thanks.

Yes, it’s the same. The RSS memory used by the test program increased after startup.
About after running for half an hour, the RSS memory no longer increased and remained unchanged.
I guess there is a maximum value and the maximum value is bigger as the program becomes more complex.
I’m wondering if there’s another way to reliably monitor memory changes. To analyze if my original program contains other memory problems

Hi,

We also monitor system memory with tegrastats and no leakage is observed.
That initial increase could be attributed to the internally allocated data structures like staging buffer etc.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.