Memory leak for CUDA runtime lib

haihua.wei · January 16, 2024, 8:26am

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
[*] DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
[*] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
[*] other

SDK Manager Version
1.9.3.10904
[*] other

Host Machine Version
[*] native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

I’m using the CUDA runtime library and I’m seeing a memory leak.

The code is as follows：
int main(int argc, char** argv) {
cudaSetDevice(0);
{
cudaStream_t stream=nullptr;
cudaStreamCreate(&stream);
cudaStreamDestroy(stream);
}
cudaDeviceSynchronize();
cudaError cuda_error=cudaDeviceReset();
if(cuda_error!=cudaSuccess){
std::cerr<<“cudaDeviceReset Error :”<<cuda_error;
}
return 0;
}

test cmd:
/opt/data/haihuawei/valgrind/bin/valgrind --tool=memcheck --leak-check=full --log-file=valgrind.log.txt ./cuda_test

log:
==311760== 8 bytes in 1 blocks are definitely lost in loss record 23 of 1,475
==311760== at 0x484B828: malloc (vg_replace_malloc.c:442)
==311760== by 0x2E98EC4F: ??? (in /usr/lib/libnvcucompat.so)
==311760== by 0x74EEBD7: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x7516CE7: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x74EF043: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x7371717: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x741E50B: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x750C72B: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x878BFBF: ??? (in /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.291)
==311760== by 0x878EB93: ??? (in /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.291)
==311760== by 0x8E8ADF3: __pthread_once_slow (pthread_once.c:116)
==311760== by 0x87D3613: ??? (in /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.291)
==311760==
==311760== 12 bytes in 1 blocks are definitely lost in loss record 25 of 1,475
==311760== at 0x484B828: malloc (vg_replace_malloc.c:442)
==311760== by 0x2E98EC2F: ??? (in /usr/lib/libnvcucompat.so)
==311760== by 0x74EEBD7: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x7516CE7: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x74EF043: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x7371717: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x741E50B: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x750C72B: ??? (in /usr/lib/libcuda.so.1)
==311760== by 0x878BFBF: ??? (in /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.291)
==311760== by 0x878EB93: ??? (in /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.291)
==311760== by 0x8E8ADF3: __pthread_once_slow (pthread_once.c:116)
==311760== by 0x87D3613: ??? (in /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.291)
==311760==

SivaRamaKrishnaNV · January 16, 2024, 11:46am

Dear @haihua.wei,
Below is the output observed on my machine.

nvidia@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/simpleCUDATest$ cat simpleCUDATest.cu
#include <cuda_runtime.h>

// includes, project
#include <helper_cuda.h>
#include <helper_functions.h>
int main(int argc, char** argv)
{
cudaSetDevice(0);
{
cudaStream_t stream=nullptr;
cudaStreamCreate(&stream);
cudaStreamDestroy(stream);
}
cudaDeviceSynchronize();
cudaError cuda_error=cudaDeviceReset();
if(cuda_error!=cudaSuccess){
std::cerr<< "cudaDeviceReset Error :"<<cuda_error;
}
return 0;
}

nvidia@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/simpleCUDATest$ valgrind --tool=memcheck --leak-check=full --log-file=/home/nvidia/valgrind.log.txt ./simpleCUDATest
Illegal instruction
nvidia@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/simpleCUDATest$ cat ~/valgrind.log.txt
==10016== Memcheck, a memory error detector
==10016== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==10016== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==10016== Command: ./simpleCUDATest
==10016== Parent PID: 9795
==10016==
ARM64 front end: load_store
disInstr(arm64): unhandled instruction 0xB8A18001
disInstr(arm64): 1011'1000 1010'0001 1000'0000 0000'0001
==10016== valgrind: Unrecognised instruction at address 0x53d8398.
==10016==    at 0x53D8398: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x543F713: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x543FD43: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x53D34D3: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x13868B: __cudart106 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x1387EB: __cudart917 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x48A23B7: __pthread_once_slow (pthread_once.c:116)
==10016==    by 0x187C17: __cudart1189 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x12EDB3: __cudart104 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x156447: cudaSetDevice (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x11140B: main (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016== Your program just tried to execute an instruction that Valgrind
==10016== did not recognise.  There are two possible reasons for this.
==10016== 1. Your program has a bug and erroneously jumped to a non-code
==10016==    location.  If you are running Memcheck and you just saw a
==10016==    warning about a bad jump, it's probably your program's fault.
==10016== 2. The instruction is legitimate but Valgrind doesn't handle it,
==10016==    i.e. it's Valgrind's fault.  If you think this is the case or
==10016==    you are not sure, please let us know and we'll try to fix it.
==10016== Either way, Valgrind will now raise a SIGILL signal which will
==10016== probably kill your program.
==10016==
==10016== Process terminating with default action of signal 4 (SIGILL)
==10016==  Illegal opcode at address 0x53D8398
==10016==    at 0x53D8398: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x543F713: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x543FD43: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x53D34D3: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x13868B: __cudart106 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x1387EB: __cudart917 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x48A23B7: __pthread_once_slow (pthread_once.c:116)
==10016==    by 0x187C17: __cudart1189 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x12EDB3: __cudart104 (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x156447: cudaSetDevice (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==    by 0x11140B: main (in /usr/local/cuda-11.4/samples/0_Simple/simpleCUDATest/simpleCUDATest)
==10016==
==10016== HEAP SUMMARY:
==10016==     in use at exit: 80,990 bytes in 56 blocks
==10016==   total heap usage: 81 allocs, 25 frees, 292,782 bytes allocated
==10016==
==10016== 56 bytes in 1 blocks are possibly lost in loss record 7 of 18
==10016==    at 0x4849D8C: malloc (in /usr/lib/aarch64-linux-gnu/valgrind/vgpreload_memcheck-arm64-linux.so)
==10016==    by 0x51FCB8F: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x51F27F3: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x51DC1AB: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x400E8B3: call_init.part.0 (dl-init.c:72)
==10016==    by 0x400E9B3: call_init (dl-init.c:30)
==10016==    by 0x400E9B3: _dl_init (dl-init.c:119)
==10016==    by 0x4BC620B: _dl_catch_exception (dl-error-skeleton.c:182)
==10016==    by 0x4012A13: dl_open_worker (dl-open.c:758)
==10016==    by 0x4BC61AB: _dl_catch_exception (dl-error-skeleton.c:208)
==10016==    by 0x40121A3: _dl_open (dl-open.c:837)
==10016==    by 0x48C409B: dlopen_doit (dlopen.c:66)
==10016==    by 0x4BC61AB: _dl_catch_exception (dl-error-skeleton.c:208)
==10016==
==10016== 336 bytes in 6 blocks are possibly lost in loss record 14 of 18
==10016==    at 0x4849D8C: malloc (in /usr/lib/aarch64-linux-gnu/valgrind/vgpreload_memcheck-arm64-linux.so)
==10016==    by 0x51FCB8F: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x51F2753: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x51DC1AB: ??? (in /usr/lib/libcuda.so.1)
==10016==    by 0x400E8B3: call_init.part.0 (dl-init.c:72)
==10016==    by 0x400E9B3: call_init (dl-init.c:30)
==10016==    by 0x400E9B3: _dl_init (dl-init.c:119)
==10016==    by 0x4BC620B: _dl_catch_exception (dl-error-skeleton.c:182)
==10016==    by 0x4012A13: dl_open_worker (dl-open.c:758)
==10016==    by 0x4BC61AB: _dl_catch_exception (dl-error-skeleton.c:208)
==10016==    by 0x40121A3: _dl_open (dl-open.c:837)
==10016==    by 0x48C409B: dlopen_doit (dlopen.c:66)
==10016==    by 0x4BC61AB: _dl_catch_exception (dl-error-skeleton.c:208)
==10016==
==10016== LEAK SUMMARY:
==10016==    definitely lost: 0 bytes in 0 blocks
==10016==    indirectly lost: 0 bytes in 0 blocks
==10016==      possibly lost: 392 bytes in 7 blocks
==10016==    still reachable: 80,598 bytes in 49 blocks
==10016==         suppressed: 0 bytes in 0 blocks
==10016== Reachable blocks (those to which a pointer was found) are not shown.
==10016== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==10016==
==10016== For lists of detected and suppressed errors, rerun with: -s
==10016== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

I don’t see definitely lost messages in log.

haihua.wei · January 16, 2024, 11:54am

@SivaRamaKrishnaNV What your machine is like in a Drive OS and CUDA environment. Can you give us a reference?

SivaRamaKrishnaNV · January 16, 2024, 12:00pm

Dear @haihua.wei,
Yes. I quickly tested on DRIVE AGX Orin platform with DRIVE OS(latest internal release).
I will try with recent devzone release and update you the results

haihua.wei · January 16, 2024, 12:06pm

@SivaRamaKrishnaNV It’s been a huge help to us. Thank you so much.

SivaRamaKrishnaNV · January 16, 2024, 4:41pm

Dear @haihua.wei,
How about using https://docs.nvidia.com/cuda/compute-sanitizer/index.html ?

Topic		Replies	Views
Is there a memory leak in CUDA CUDA Programming and Performance	6	7308	June 11, 2008
Huge memory leak CUDA Programming and Performance	16	5986	July 27, 2016
Cuda libraries have memory errors CUDA Programming and Performance	0	3636	August 20, 2011
memory leak on cudaGetDeviceCount ? CUDA Programming and Performance	1	6804	October 14, 2009
Cudla api cudlaImportExternalSemaphore memory-leak DRIVE AGX Orin General driveos-dl	11	623	January 23, 2024
Unexpected leak CUDA Programming and Performance	9	6070	October 13, 2008
Memory leak running CUDA C program Jetson Orin Nano cuda	7	212	December 11, 2024
Cuda memory leak CUDA Programming and Performance	0	741	August 26, 2020
Memory leaks in libcudart 4.2.9 or misuse? CUDA Programming and Performance	2	2204	June 7, 2012
Memory leaks in example simpleD3D9Texture CUDA Programming and Performance	1	5102	December 19, 2008

Memory leak for CUDA runtime lib

Related topics