Memory leak in TensorRT 6?

tom.petersy1wb7 · October 1, 2019, 5:29pm

Linux distro and version: Ubuntu 16.04
GPU type: TITAN Xp
nvidia driver version: 418.87.00
CUDA version: 10.1
CUDNN version: 7.6.3
TensorRT version: 6.0.1.5

After recently upgrading to TensorRT 6, we’ve been noticing memory leak warnings which didn’t appear in TensorRT 5.

Leak reports take the following form (from Valgrind’s Memcheck):

==23092== 2,408 bytes in 1 blocks are definitely lost in loss record 2,252 of 2,645
==23092==    at 0x402DE03: malloc (vg_replace_malloc.c:299)
==23092==    by 0x10E03D03: ??? (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x1059B0C7: nvinfer1::rt::cuda::WinogradConvActRunner::updateConvolution(dit::Convolution*, nvinfer1::rt::CommonContext const&, signed char const*, nvinfer1::utils::TensorLayout const&, nvinfer1::utils::TensorLayout const&) const (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x1059B26A: nvinfer1::rt::cuda::WinogradConvActRunner::recomputeResources(nvinfer1::rt::CommonContext const&) (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x1071E509: nvinfer1::rt::SafeEngine::initialize(nvinfer1::rt::CommonContext&, std::vector<nvinfer1::rt::EngineLayerAttribute, std::allocator<nvinfer1::rt::EngineLayerAttribute> > const&) (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x10535688: nvinfer1::rt::Engine::initialize(std::vector<nvinfer1::rt::EngineLayerAttribute, std::allocator<nvinfer1::rt::EngineLayerAttribute> > const&) (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x107093AA: ??? (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x1070ABDC: nvinfer1::builder::buildEngine(nvinfer1::NetworkBuildConfig&, nvinfer1::builder::EngineBuildContext const&, nvinfer1::Network const&) (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x105CCA2A: nvinfer1::builder::Builder::buildInternal(nvinfer1::NetworkBuildConfig&, nvinfer1::builder::EngineBuildContext const&, nvinfer1::Network const&) (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)
==23092==    by 0x105CD909: nvinfer1::builder::Builder::buildEngineWithConfig(nvinfer1::INetworkDefinition&, nvinfer1::IBuilderConfig&) (in /home/tom/projects/wraw/build/private/libnvinfer.so.6)

We’re careful to destroy all TensorRT objects (by wrapping in smart pointers which call destroy() when they go out of scope). In particular, I’m sure that all of the TensorRT objects shown above (the builder, the builderConfig, the networkDefinition, and the engine) have all had destroy called on them. We also use Plugin layers, but the leaked alloc shown above points to TensorRT and not our code (we never use raw malloc anyway, and this was from a debug build). Who is responsible for freeing the above memory?

I’m attempting to create a minimal repro for you, but it could be tricky (we use the network builder interface, use Plugin layers, wrap TensorRT quite a bit, and have proprietary models).

Regards,
Tom Peters

SunilJB · December 30, 2019, 9:24am

Hi,

We have added some fix for WinogradConvActRunner in TRT 7.0.
Could you please try TRT 7.0?

Thanks

tom.petersy1wb7 · January 9, 2020, 6:11am

Hi SunilJB,

Thanks for the response. We’ll give that a try.

Regards,
Tom

federico.martinez · February 13, 2020, 1:07pm

I am facing the same issue in TRT6, however I cannot move to TRT7 because I’m developing for Jetson Xavier. Is there a JetPack with these fixes somewhere?

@Tom did you find it solved in TRT7?

Best,

tom.petersy1wb7 · February 14, 2020, 6:12pm

Hi Federico,

Unfortunately, I haven’t had access to an nvidia GPU in many months, so I haven’t been able to verify their claim.

Good luck,
Tom

microlj · April 13, 2020, 8:10am

have this issue been resolved?
I also met this issue with TensorRT7.
GPU: Tesla P4
Ubuntu16.04
CUDA version: 9.0
CUDNN version: 7.5.1
TensorRT version: 7.0.0.11

==12591== 119,832 bytes in 111 blocks are definitely lost in loss record 1,829 of 1,935
==12591==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12591==    by 0x740B753: ??? (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)
==12591==    by 0x6BA498F: nvinfer1::rt::cuda::WinogradConvActRunner::updateConvolution(dit::Convolution*, nvinfer1::rt::CommonContext const&, signed char const*, nvinfer1::utils::TensorLayout const&, nvinfer1::utils::TensorLayout const&) const (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)
==12591==    by 0x6BA4C1C: nvinfer1::rt::cuda::WinogradConvActRunner::recomputeResources(nvinfer1::rt::CommonContext const&, nvinfer1::rt::ExecutionParameters*) (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)
==12591==    by 0x6D7A732: nvinfer1::rt::SafeEngine::initialize(nvinfer1::rt::CommonContext&, std::vector<nvinfer1::rt::EngineLayerAttribute, std::allocator<nvinfer1::rt::EngineLayerAttribute> > const&) (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)
==12591==    by 0x6B2D7ED: nvinfer1::rt::Engine::initialize(std::vector<nvinfer1::rt::EngineLayerAttribute, std::allocator<nvinfer1::rt::EngineLayerAttribute> > const&) (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)
==12591==    by 0x6B2E52B: nvinfer1::rt::Engine::deserialize(void const*, unsigned long, nvinfer1::IGpuAllocator&, nvinfer1::IPluginFactory*) (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)
==12591==    by 0x6B36A08: nvinfer1::Runtime::deserializeCudaEngine(void const*, unsigned long, nvinfer1::IPluginFactory*) (in /ljay/workspace/tools/nvidia/tensorrt/TensorRT-7.0.0.11/targets/x86_64-linux-gnu/lib/libnvinfer.so.7.0.0)

I can observe the gpu memory usage growing and OOM at last.

---------- iter 0: ----------
Memory used: 123
---------- iter 1: ----------
Memory used: 769
---------- iter 2: ----------
Memory used: 921
---------- iter 3: ----------
Memory used: 1073
...
...
---------- iter 43: ----------
Memory used: 7207
---------- iter 44: ----------
Memory used: 7361
---------- iter 45: ----------
Memory used: 7513
terminate called after throwing an instance of 'std::runtime_error'

tom.petersy1wb7 · April 13, 2020, 3:59pm

@SunilJB after upgrading to TensorRT 7, I no longer detect this leak. Thank you.

@microlj I can’t speak for your issue. Do you have the same problem in TensorRT 6 (assuming you’re able to run there)? Are you sure you call destroy on all objects owned by TensorRT?

microlj · April 14, 2020, 11:53am

not try TRT6.
I hit this issue on Tesla P4. I tried Tesla T4 today with the same code, and there’s no mem leak. Weird… Anyway, I’m checking my code.