Segmentation fault with multithreaded engine build

Linux distro and version: Ubuntu 16.04.6 LTS
GPU type: TITAN Xp
Nvidia driver version: 418.87.00
CUDA version: 10.1
CUDNN version: 7.6.3
TensorRT version: 6.0.1

According to my understanding of https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#thread-safety, it should be possible to build tensorRT engines concurrently from multiple threads. However, I am not finding this to be the case on the above platform. Included below is an example which (periodically) segfaults on this platform. Note that I have not (despite many hundreds of runs) been able to reproduce this issue on another machine running Ubuntu 18.04.4 LTS and TensorRT 7.0.0. Finally, note that I have encountered crashes that I suspect are related on Windows with TensorRT 6, but have not run this specific example, nor have I tried TensorRT7 on Windows yet.

Will NVIDIA ever push patches to TensorRT 6, or is its development frozen (and all bugfixes put on TRT7 and beyond)?

What follows is a reproducing example (without model files – to share those, we’d need to be in touch), followed by gdb backtraces of the core dump.

Reproducing example:

#include <NvCaffeParser.h>
#include <NvInfer.h>

#include <cassert>
#include <iostream>
#include <thread>
#include <vector>

namespace {
class Logger : public nvinfer1::ILogger {
public:
  void log(nvinfer1::ILogger::Severity severity, const char *msg) override {}
};
} // unnamed namespace

static ::Logger g_logger;

static void ConstructNetwork(nvcaffeparser1::ICaffeParser &parser,
                             nvinfer1::IBuilder &builder,
                             nvinfer1::IBuilderConfig &builderConfig,
                             nvinfer1::INetworkDefinition &network) {
  const nvcaffeparser1::IBlobNameToTensor *blobNameToTensor =
      parser.parse("models/model.proto", "models/model.caffemodel", network,
                   nvinfer1::DataType::kFLOAT);

  network.markOutput(*blobNameToTensor->find("ms_joint_maps"));
  network.markOutput(*blobNameToTensor->find("ms_bone_maps"));
  network.markOutput(*blobNameToTensor->find("seg_maps"));

  builderConfig.setMaxWorkspaceSize(1 << 30);
}

static auto BuildEngine() {
  auto builder = nvinfer1::createInferBuilder(g_logger);
  assert(builder != nullptr);
  auto builderConfig = builder->createBuilderConfig();
  assert(builderConfig != nullptr);
  auto network = builder->createNetworkV2(0U);
  assert(network != nullptr);
  auto parser = nvcaffeparser1::createCaffeParser();
  assert(parser != nullptr);

  ConstructNetwork(*parser, *builder, *builderConfig, *network);
  assert(network != nullptr);
  std::cout << "building cuda engine ... \n";
  auto engine = builder->buildEngineWithConfig(*network, *builderConfig);
  assert(engine != nullptr);

  return engine;
}

int main() {
  int const numEngines = 2;
  std::vector<std::thread> initThreads;

  for (int ix = 0; ix < numEngines; ++ix) {
    initThreads.emplace_back(BuildEngine);
  }

  for (auto &t : initThreads) {
    t.join();
  }

  return EXIT_SUCCESS;
}

GDB backtraces from a core-dump:

Program terminated with signal SIGSEGV, Segmentation fault.                                                                                                                                                                          
#0  __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:270
270     ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory.
[Current thread is 1 (Thread 0x7fceebfae700 (LWP 24414))]
(gdb) thread apply all bt

Thread 6 (Thread 0x7fcee9be7700 (LWP 24417)):
#0  0x00007fcf0bfc88c8 in accept4 (fd=8, addr=..., addr_len=0x7fcee9be6398, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
#1  0x00007fceea04f6fa in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fceea04199d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fceea050dc8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fcf0cb956ba in start_thread (arg=0x7fcee9be7700) at pthread_create.c:333
#5  0x00007fcf0bfc741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fcf1a83f1c0 (LWP 24413)):
#0  0x00007fcf0cb9698d in pthread_join (threadid=140526699079424, thread_return=0x0) at pthread_join.c:90
#1  0x00007fcf0c87bac3 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00000000004010cd in main () at main.cpp:61

Thread 4 (Thread 0x7fcee93e6700 (LWP 24418)):
#0  0x00007fcf0bfbb74d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fceea04e733 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fceea0dd4dd in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fceea050dc8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fcf0cb956ba in start_thread (arg=0x7fcee93e6700) at pthread_create.c:333
#5  0x00007fcf0bfc741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7fcee8bdb700 (LWP 24419)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x00007fceea051b47 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fcee9ff7287 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fceea050dc8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fcf0cb956ba in start_thread (arg=0x7fcee8bdb700) at pthread_create.c:333
#5  0x00007fcf0bfc741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7fceeb7ad700 (LWP 24415)):
#0  0x00007fcf0c84fcb0 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fcf0c29a3ea in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2  0x00007fcf0c8507b8 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fcf0c846fee in std::__throw_length_error(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fcf0c893999 in std::string::assign(char const*, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fcf0c88a0fd in std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::overflow(int) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fcf0c8e818a in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fcf0c8d7453 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fcf0d70ab1b in nvinfer1::builder::buildEngine(nvinfer1::NetworkBuildConfig&, nvinfer1::builder::EngineBuildContext const&, nvinfer1::Network const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#9  0x00007fcf0d5cca2b in nvinfer1::builder::Builder::buildInternal(nvinfer1::NetworkBuildConfig&, nvinfer1::builder::EngineBuildContext const&, nvinfer1::Network const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#10 0x00007fcf0d5cd90a in nvinfer1::builder::Builder::buildEngineWithConfig(nvinfer1::INetworkDefinition&, nvinfer1::IBuilderConfig&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#11 0x00000000004012cb in BuildEngine () at main.cpp:46
#12 0x0000000000402327 in std::__invoke_impl<nvinfer1::ICudaEngine*, nvinfer1::ICudaEngine* (*)()> (__f=@0x1deb268: 0x401110 <BuildEngine()>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60
#13 0x00000000004022bd in std::__invoke<nvinfer1::ICudaEngine* (*)()> (__fn=@0x1deb268: 0x401110 <BuildEngine()>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95
#14 0x0000000000402295 in std::thread::_Invoker<std::tuple<nvinfer1::ICudaEngine* (*)()> >::_M_invoke<0ul> (this=0x1deb268) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244
#15 0x0000000000402265 in std::thread::_Invoker<std::tuple<nvinfer1::ICudaEngine* (*)()> >::operator() (this=0x1deb268) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251
#16 0x000000000040213e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nvinfer1::ICudaEngine* (*)()> > >::_M_run (this=0x1deb260) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195
#17 0x00007fcf0c87b890 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007fcf0cb956ba in start_thread (arg=0x7fceeb7ad700) at pthread_create.c:333
#19 0x00007fcf0bfc741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7fceebfae700 (LWP 24414)):
#0  __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:270
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00007fcf0c8e8158 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fcf0c8d7453 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fcf0d70ab1b in nvinfer1::builder::buildEngine(nvinfer1::NetworkBuildConfig&, nvinfer1::builder::EngineBuildContext const&, nvinfer1::Network const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#4  0x00007fcf0d5cca2b in nvinfer1::builder::Builder::buildInternal(nvinfer1::NetworkBuildConfig&, nvinfer1::builder::EngineBuildContext const&, nvinfer1::Network const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#5  0x00007fcf0d5cd90a in nvinfer1::builder::Builder::buildEngineWithConfig(nvinfer1::INetworkDefinition&, nvinfer1::IBuilderConfig&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#6  0x00000000004012cb in BuildEngine () at main.cpp:46
#7  0x0000000000402327 in std::__invoke_impl<nvinfer1::ICudaEngine*, nvinfer1::ICudaEngine* (*)()> (__f=@0x1de89a8: 0x401110 <BuildEngine()>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60
#8  0x00000000004022bd in std::__invoke<nvinfer1::ICudaEngine* (*)()> (__fn=@0x1de89a8: 0x401110 <BuildEngine()>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95
#9  0x0000000000402295 in std::thread::_Invoker<std::tuple<nvinfer1::ICudaEngine* (*)()> >::_M_invoke<0ul> (this=0x1de89a8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244
#10 0x0000000000402265 in std::thread::_Invoker<std::tuple<nvinfer1::ICudaEngine* (*)()> >::operator() (this=0x1de89a8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251
#11 0x000000000040213e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nvinfer1::ICudaEngine* (*)()> > >::_M_run (this=0x1de89a0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195
#12 0x00007fcf0c87b890 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007fcf0cb956ba in start_thread (arg=0x7fceebfae700) at pthread_create.c:333
#14 0x00007fcf0bfc741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
1 Like

Hi Tom,

At a glance, the code seems fine. My only suspicion is this line in the docs"If using multiple builder or runtime objects, use the same logger, and ensure that it is thread-safe." - but I’m assuming this thread-safety note only relates if you explicitly use the logger yourself for additional logging.

I’m looking into the issue.

As a side note, do you mind sharing your motivation to stay with TensorRT 6 and not move to TensorRT 7? This might help me to get a better answer than to just suggest upgrading.

Will NVIDIA ever push patches to TensorRT 6, or is its development frozen (and all bugfixes put on TRT7 and beyond)?

There are no plans on back-porting bug fixes to older versions of TensorRT.

Some other things that stand out that you may want to try if sticking with TensorRT 6.

  1. Try calling cudaSetDevice(idx) within each thread to assign one GPU per builder. Having multiple builders per GPU will cause issues with the kernel/tactic selection, and probably lead to bad performance. This likely isn’t the cause of the seg faults, but is important to know/use, even in TensorRT 7 as well.

  2. Try using methods/flags on builder only, and not builderConfig. I personally encountered some issues with builderConfig periodically in TensorRT 6, but builderConfig has worked seamlessly for me in TensorRT 7.

However, I strongly recommend keeping up to date with the latest version of TensorRT to capture bug fixes.

Thanks for the replies.

Note that this example is using a single global logger. The logger itself has a no-op log function, so is obviously thread-safe.

do you mind sharing your motivation to stay with TensorRT 6 and not move to TensorRT 7? This might help me to get a better answer than to just suggest upgrading.

It’s purely a business inertia problem: it takes time to upgrade our developers/servers/clients. We evaluated TensorRT 7 and based on the releasenotes and our own performance measurements, we didn’t see a strong reason to move to it. Bugfixes (which probably are not all advertised in releasenotes), would be strong reasons for us to upgrade.

Thanks for your subsequent notes.

Try calling cudaSetDevice(idx)

Yep. The original example was trying to work on multiple GPUs, but this was a smaller example which still crashed, so we posted that.

Try using methods/flags on builder only, and not builderConfig. I personally encountered some issues with builderConfig periodically in TensorRT 6

Would you be able to share any of those issues? Indeed, we only used the builderConfig because certain ops on the builder became deprecated.

I strongly recommend keeping up to date with the latest version of TensorRT to capture bug fixes.

That’s useful advice and we’ll keep it in mind!

The logger itself has a no-op log function, so is obviously thread-safe.

Ah, right, I completely glanced over the implementation, my apologies.

It’s purely a business inertia problem

I see. Unfortunately, there are no plans to retroactively fix bugs in previous versions, so using the newest version is all I can recommend for now.

Would you be able to share any of those issues? Indeed, we only used the builderConfig because certain ops on the builder became deprecated.

This only came up when trying to do certain workflows with INT8 calibration, specifically I think when implementing the readCalibrationCache method, it would give an error about calling a virtual method even though I clearly implemented it.

This error only came up when setting builderConfig->int8_calibrator = calibrator, and the calibration worked fine when instead using builder->int8_calibrator = calibrator.

It’s not directly related to your issue, but I just suspect there may have been some underlying issues.

Update: I can also crash TensorRT 7 as well with this example, on Ubuntu 18.04 (full specs to follow). It’s segfaulting roughly 1 out of every 1500 runs (each run being ~30sec, so crashing about once every 12 hours). Unfortunately, loading my core file into gdb I see “?? ()” all the way down … maybe someone is smashing the stack? Out of curiosity, I’m going to repeat this experiment with a mutex around initialization, to perhaps rule out certain data races (not that there’s anything I can do about it!).

Linux distro and version: Ubuntu 18.04.4 LTS
GPU type: GeForce RTX 2080 Ti AND GeForce RTX 2070 SUPER (the test seems to pick the 2080, but I can’t be sure it always does)
Nvidia driver version: 440.64.00
CUDA version: 10.2
CUDNN version: 7.6.5
TensorRT version: 7.0.0.11

With a mutex, I ran 4329 times without a crash (on Ubuntu 18.04 and TRT 7), so I’m guessing this really is a race condition. If this didn’t work I would have been much more confused and upset!

Hi Tom,

Can you share the modified script to run indefinitely (until crash) and the Makefile associated with it? I’ll see if I can get a good repro and raise this internally.

Hi there,

I built with clang++-9 -std=c++17 main.cpp -lnvinfer -lnvparsers -lpthread and looped with

#!/bin/bash

ct=1
while ./a.out;
do
        echo $ct
        ct=$((ct+1))
done

The model files are large and proprietary. Could we discuss over a private message how to share them?

Thanks again,
Tom

Hi Tom,

Thanks for sharing. A couple more questions if you don’t mind:

  1. In your tests, this segfault every ~1500 runs is with 2 threads building in parallel?
  2. Can you reproduce this behavior with just 1 builder, no parallelism?
  3. Regarding sharing your model - I would think this isn’t specific to the model itself, and maybe more with the underlying API. Could you try reproducing using a simpler dummy model? (Might speed up the rate of segfault as well with a smaller model)

These could potentially provide a lot of insight into the underlying issue here.

Also, just to clarify, your real use case isn’t building the engines repeatedly from the same model file, it’s that you build 1 engine per GPU in parallel every so often during testing/development, and you happen upon a seemingly random segfault while doing so every so often? Please correct me if I’m wrong, or further clarify on the use case if possible.

Thanks,
Ryan

Hi Ryan,

That’s correct. That failure rate is what I estimate on Ubuntu 18.04 running TRT 7 with 2 threads building in parallel. The failure rate was higher on Ubuntu 16.04 with TRT 6.

  1. Can you reproduce this behavior with just 1 builder, no parallelism?

I can give that a try. If that doesn’t work something more fundamental may be going on. I did try serializing two builders in parallel with a mutex at the start of BuildEngine and never hit an issue after looping for 2 days straight (on Ubuntu 18.04 and TRT 7 at least). We have also been loading this model serially (with different code) for a very long time and never hit an issue.

  1. Regarding sharing your model - I would think this isn’t specific to the model itself, and maybe more with the underlying API. Could you try reproducing using a simpler dummy model? (Might speed up the rate of segfault as well with a smaller model)

Yeah, I can maybe try some smaller models. But since the build times are so long for this one (30 sec or so), it might actually be easier to crash with the heavyweight model. Who knows.

Also, just to clarify, your real use case isn’t building the engines repeatedly from the same model file, it’s that you build 1 engine per GPU in parallel every so often during testing/development, and you happen upon a seemingly random segfault while doing so every so often? Please correct me if I’m wrong, or further clarify on the use case if possible.

I don’t exactly know the use case because it’s from a client of our library (which links your library). I believe what you said is correct: they want to build two engines on two separate GPUs in parallel. Crashing is unacceptable, even sporadically. Even if their (and our) code is somehow non-optimal, it should be logically correct and not crash. I pared down their example using our library to an example using only TensorRT (and the little bit of code I wrote above). The TensorRT docs claim thread safety (with some caveats), so that’s why I raised the issue here.

I’d also like to see how Windows does with this. If I discover anything interesting, I’ll write back.

Thanks again,
Tom

Thanks Tom, I look forward to seeing the results.

Hey Ryan,

I have some updates:

Can you reproduce this behavior with just 1 builder, no parallelism?

Nope, one builder was just fine (as was placing a mutex). I ran for several days straight with no problems.

Could you try reproducing using a simpler dummy model?

That was a good suggestion, thanks! I modified two of your TensorRT samples to build multiple engines concurrently (sampleMNIST and sampleFasterRCNN) and was able to crash them both (without running any inference). For neither sample is the failure rate 100% nor deterministic.

I’ve been working on

Linux distro and version: Ubuntu 18.04.4 LTS
GPU type: GeForce RTX 2080 Ti AND GeForce RTX 2070 SUPER (the test seems to pick the 2080, but I can’t be sure it always does)
Nvidia driver version: 440.64.00
CUDA version: 10.2
CUDNN version: 7.6.5
TensorRT version: 7.0.0.11

Here’s a backtrace from a modified sampleMNIST running 128 concurrent builders (interestingly this is similar to the backtrace I posted from our own network):

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:491
491     ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
[Current thread is 1 (Thread 0x7fd9b27fc700 (LWP 6324))]
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:491
#1  0x00007fda1b80d478 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fda1b7fdb84 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fda1c5fc585 in nvinfer1::NetworkTensor::setDynamicRange(float, float) ()
   from /usr/lib/x86_64-linux-gnu/libnvinfer.so.7
#4  0x0000000000402c12 in SampleMNIST::constructNetwork(std::unique_ptr<nvcaffeparser1::ICaffeParser, samplesCommon::InferDeleter>&, std::unique_ptr<nvinfer1::INetworkDefinition, samplesCommon::InferDeleter>&) ()
#5  0x00000000004023ab in SampleMNIST::build() ()
#6  0x00000000004043c5 in main::$_0::operator()(int) const ()
#7  0x0000000000404381 in _ZSt13__invoke_implIvZ4mainE3$_0JiEET_St14__invoke_otherOT0_DpOT1_ ()
#8  0x00000000004042d2 in _ZSt8__invokeIZ4mainE3$_0JiEENSt15__invoke_resultIT_JDpT0_EE4typeEOS2_DpOS3_ ()
#9  0x0000000000404292 in std::thread::_Invoker<std::tuple<main::$_0(int)> >::_M_invoke<0ul, 1ul> ()
#10 0x0000000000404245 in std::thread::_Invoker<std::tuple<main::$_0(int)> >::operator()() ()
#11 0x00000000004040a9 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::$_0(int)> > >::_M_run() ()
#12 0x00007fda1b7a76df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007fda1ba7a6db in start_thread (arg=0x7fd9b27fc700) at pthread_create.c:463
#14 0x00007fda1ae6488f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Here’s a backtrace from a modified sampleFasterRCNN, which ran 8 concurrent builders and hit “double free or corruption (fasttop)”:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fbf5df0a700 (LWP 12859))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fbf8354c801 in __GI_abort () at abort.c:79
#2  0x00007fbf83595897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fbf836c2b9a "%s\n")
    at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fbf8359c90a in malloc_printerr (str=str@entry=0x7fbf836c4828 "double free or corruption (fasttop)")
    at malloc.c:5350
#4  0x00007fbf835a4004 in _int_free (have_lock=0, p=0x7fbeea1449b0, av=0x7fbf54000020) at malloc.c:4230
#5  __GI___libc_free (mem=0x7fbeea1449c0) at malloc.c:3124
#6  0x00007fbf5b0030ac in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fbf5af2a1d0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007fbf5b035093 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fbf5aefd102 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007fbf5ae03229 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#11 0x00007fbf5af877e9 in cuMemFreeHost () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#12 0x00007fbf937445bd in ?? () from /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
#13 0x00007fbf9372251c in ?? () from /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
#14 0x00007fbf93755101 in cudaFreeHost () from /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
#15 0x00007fbf93a03c2c in nvinfer1::plugin::RPROIPlugin::~RPROIPlugin() ()
   from /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7
#16 0x00007fbf93a03cef in nvinfer1::plugin::RPROIPlugin::destroy() ()
   from /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7
#17 0x00007fbf8448affe in ?? () from /usr/lib/x86_64-linux-gnu/libnvparsers.so.7
#18 0x00007fbf8449a57c in ?? () from /usr/lib/x86_64-linux-gnu/libnvparsers.so.7
#19 0x00000000004124f8 in void samplesCommon::InferDeleter::operator()<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::ICaffeParser*) const ()
#20 0x000000000040a050 in std::unique_ptr<nvcaffeparser1::ICaffeParser, samplesCommon::InferDeleter>::~unique_ptr() ()
#21 0x0000000000403a32 in SampleFasterRCNN::build() ()
---Type <return> to continue, or q <return> to quit---#22 0x0000000000409b17 in main::$_2::operator()() const ()
#23 0x0000000000409a6d in _ZSt13__invoke_implIvZ4mainE3$_2JEET_St14__invoke_otherOT0_DpOT1_ ()
#24 0x00000000004099fd in _ZSt8__invokeIZ4mainE3$_2JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS2_DpOS3_ ()
#25 0x00000000004099d5 in std::thread::_Invoker<std::tuple<main::$_2> >::_M_invoke<0ul> ()
#26 0x00000000004099a5 in std::thread::_Invoker<std::tuple<main::$_2> >::operator()() ()
#27 0x0000000000409889 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::$_2> > >::_M_run() ()
#28 0x00007fbf83f706df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#29 0x00007fbf842436db in start_thread (arg=0x7fbf5df0a700) at pthread_create.c:463
#30 0x00007fbf8362d88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I’m not going to include my actual modifications unless you want me to, but here were the changes: first, I moved to a no-op logger just to remove that from consideration (as in the example at the beginning of the thread). Then I launched multiple sample.build() (for distinct sample instances) in concurrent threads (again, analogous to my example). No inference was performed. Then Iooped the samples using the shell script shown earlier.

I hope this is actually the issue my company’s network was hitting. If not, we may be able to share the prototxt with you for exploration (I don’t think sharing weights is necessary, but I’m verifying now if I can crash without them).

Regards,
Tom

and yes I can repro the crash on my company’s model using only the prototxt, no caffemodel (ICaffeParser doesn’t require a caffemodel). It crashes in a similar way when I passed a caffemodel. With 4 threads, I crashed after ~300 iterations. Sharing this prototxt (and code) could be another easy way to get you a repro (if you were interested and my supervisors agreed).

Cheers,
Tom

Hey Tom,

Thanks for the detailed response.

Yes, please share your modified MNIST sample + a Makefile - this saves me some trouble and “context switching” when juggling several threads/issues, and makes it faster for me to file reproducible bugs as necessary.

Hey there, thanks for the response.

I don’t have any makefiles, as I was just hacking in the samples source. You can drop this in to your samples project and build it as it’s supposed to be built.

Here’s the hacked sampleMNIST.cpp:

/*
 * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

//! \file sampleMNIST.cpp
//! \brief This file contains the implementation of the MNIST sample.
//!
//! It builds a TensorRT engine by importing a trained MNIST Caffe model. It uses the engine to run
//! inference on an input image of a digit.
//! It can be run with the following command line:
//! Command: ./sample_mnist [-h or --help] [-d=/path/to/data/dir or --datadir=/path/to/data/dir]

/*
clang++ /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7
//usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2 sampleFasterRCNN.cpp -I ../../common/ -I
/home/tom/projects/TensorRT/include/ -I /usr/local/cuda-10.2/targets/x86_64-linux/include/ -lnvinfer -lnvparsers
-lpthread
*/

#include "argsParser.h"
#include "common.h"

#include "NvCaffeParser.h"
#include "NvInfer.h"

#include <algorithm>
#include <cassert>
#include <cmath>
//#include <cuda_runtime_api.h>
#include <fstream>
#include <iostream>
#include <sstream>
#include <thread>

const std::string gSampleName = "TensorRT.sample_mnist";

namespace
{
class TomLogger : public nvinfer1::ILogger
{
public:
    void log(nvinfer1::ILogger::Severity severity, const char* msg) override
    {
    }
} g_logger;
} // namespace

//!
//! \brief  The SampleMNIST class implements the MNIST sample
//!
//! \details It creates the network using a trained Caffe MNIST classification model
//!
class SampleMNIST
{
    template <typename T>
    using SampleUniquePtr = std::unique_ptr<T, samplesCommon::InferDeleter>;

public:
    SampleMNIST(const samplesCommon::CaffeSampleParams& params)
        : mParams(params)
    {
    }

    //!
    //! \brief Builds the network engine
    //!
    bool build();

    //!
    //! \brief Used to clean up any state created in the sample class
    //!
    bool teardown();

private:
    //!
    //! \brief uses a Caffe parser to create the MNIST Network and marks the
    //!        output layers
    //!
    void constructNetwork(
        SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network);

    std::shared_ptr<nvinfer1::ICudaEngine> mEngine{nullptr}; //!< The TensorRT engine used to run the network

    samplesCommon::CaffeSampleParams mParams; //!< The parameters for the sample.

    nvinfer1::Dims mInputDims; //!< The dimensions of the input to the network.

    SampleUniquePtr<nvcaffeparser1::IBinaryProtoBlob>
        mMeanBlob; //! the mean blob, which we need to keep around until build is done
};

//!
//! \brief Creates the network, configures the builder and creates the network engine
//!
//! \details This function creates the MNIST network by parsing the caffe model and builds
//!          the engine that will be used to run MNIST (mEngine)
//!
//! \return Returns true if the engine was created successfully and false otherwise
//!
bool SampleMNIST::build()
{
    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(g_logger));
    if (!builder)
    {
        return false;
    }

    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
    if (!network)
    {
        return false;
    }

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
    if (!config)
    {
        return false;
    }

    auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
    if (!parser)
    {
        return false;
    }

    constructNetwork(parser, network);
    builder->setMaxBatchSize(mParams.batchSize);
    config->setMaxWorkspaceSize(16_MiB);
    config->setFlag(BuilderFlag::kGPU_FALLBACK);
    config->setFlag(BuilderFlag::kSTRICT_TYPES);
    if (mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if (mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }

    samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);

    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
        builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());

    if (!mEngine)
        return false;

    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;
}

//!
//! \brief Uses a caffe parser to create the MNIST Network and marks the
//!        output layers
//!
//! \param network Pointer to the network that will be populated with the MNIST network
//!
//! \param builder Pointer to the engine builder
//!
void SampleMNIST::constructNetwork(
    SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network)
{
    const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor = parser->parse(
        mParams.prototxtFileName.c_str(), mParams.weightsFileName.c_str(), *network, nvinfer1::DataType::kFLOAT);

    for (auto& s : mParams.outputTensorNames)
    {
        network->markOutput(*blobNameToTensor->find(s.c_str()));
    }

    // add mean subtraction to the beginning of the network
    nvinfer1::Dims inputDims = network->getInput(0)->getDimensions();
    mMeanBlob
        = SampleUniquePtr<nvcaffeparser1::IBinaryProtoBlob>(parser->parseBinaryProto(mParams.meanFileName.c_str()));
    nvinfer1::Weights meanWeights{nvinfer1::DataType::kFLOAT, mMeanBlob->getData(), inputDims.d[1] * inputDims.d[2]};
    // For this sample, a large range based on the mean data is chosen and applied to the head of the network.
    // After the mean subtraction occurs, the range is expected to be between -127 and 127, so the rest of the network
    // is given a generic range.
    // The preferred method is use scales computed based on a representative data set
    // and apply each one individually based on the tensor. The range here is large enough for the
    // network, but is chosen for example purposes only.
    float maxMean
        = samplesCommon::getMaxValue(static_cast<const float*>(meanWeights.values), samplesCommon::volume(inputDims));

    auto mean = network->addConstant(nvinfer1::Dims3(1, inputDims.d[1], inputDims.d[2]), meanWeights);
    mean->getOutput(0)->setDynamicRange(-maxMean, maxMean);
    network->getInput(0)->setDynamicRange(-maxMean, maxMean);
    auto meanSub = network->addElementWise(*network->getInput(0), *mean->getOutput(0), ElementWiseOperation::kSUB);
    meanSub->getOutput(0)->setDynamicRange(-maxMean, maxMean);
    network->getLayer(0)->setInput(0, *meanSub->getOutput(0));
    samplesCommon::setAllTensorScales(network.get(), 127.0f, 127.0f);
}

//!
//! \brief Used to clean up any state created in the sample class
//!
bool SampleMNIST::teardown()
{
    //! Clean up the libprotobuf files as the parsing is complete
    //! \note It is not safe to use any other part of the protocol buffers library after
    //! ShutdownProtobufLibrary() has been called.
    nvcaffeparser1::shutdownProtobufLibrary();
    return true;
}

//!
//! \brief Initializes members of the params struct using the command line args
//!
samplesCommon::CaffeSampleParams initializeSampleParams(const samplesCommon::Args& args)
{
    samplesCommon::CaffeSampleParams params;
    if (args.dataDirs.empty()) //!< Use default directories if user hasn't provided directory paths
    {
        params.dataDirs.push_back("data/mnist/");
        params.dataDirs.push_back("data/samples/mnist/");
    }
    else //!< Use the data directory provided by the user
    {
        params.dataDirs = args.dataDirs;
    }

    params.prototxtFileName = locateFile("mnist.prototxt", params.dataDirs);
    params.weightsFileName = locateFile("mnist.caffemodel", params.dataDirs);
    params.meanFileName = locateFile("mnist_mean.binaryproto", params.dataDirs);
    params.inputTensorNames.push_back("data");
    params.batchSize = 1;
    params.outputTensorNames.push_back("prob");
    params.dlaCore = args.useDLACore;
    params.int8 = args.runInInt8;
    params.fp16 = args.runInFp16;

    return params;
}

//!
//! \brief Prints the help information for running this sample
//!
void printHelpInfo()
{
    std::cout
        << "Usage: ./sample_mnist [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]\n";
    std::cout << "--help          Display help information\n";
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "--int8          Run in Int8 mode.\n";
    std::cout << "--fp16          Run in FP16 mode.\n";
}

int main(int argc, char** argv)
{
    samplesCommon::Args args;
    bool argsOK = samplesCommon::parseArgs(args, argc, argv);
    if (!argsOK)
    {
        std::cerr << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }

    samplesCommon::CaffeSampleParams params = initializeSampleParams(args);

    auto buildSample = [params](int const id) {
        SampleMNIST sample(params);
        // std::cout << "Building and running a GPU inference engine for MNIST: " << id << std::endl;
        assert(sample.build());
    };

    int const numEngines = 128;
    std::vector<std::thread> initThreads;

    for (int ix = 0; ix < numEngines; ++ix)
    {
        initThreads.emplace_back(buildSample, ix);
    }

    for (auto& t : initThreads)
    {
        t.join();
    }

    return 0;
}

and here’s the hacked sampleFasterRCNN.cpp:

/*
 * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

//!
//! sampleFasterRCNN.cpp
//! This file contains the implementation of the FasterRCNN sample. It creates the network using
//! the FasterRCNN caffe model.
//! It can be run with the following command line:
//! Command: ./sample_fasterRCNN [-h or --help] [-d=/path/to/data/dir or --datadir=/path/to/data/dir]
//!

// clang++ /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7
// //usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2 sampleFasterRCNN.cpp -I ../../common/ -I
// /home/tom/projects/TensorRT/include/ -I /usr/local/cuda-10.2/targets/x86_64-linux/include/ -lnvinfer -lnvparsers
// -lpthread

#include "argsParser.h"
#include "buffers.h"
#include "common.h"
//#include "logger.h"

#include "NvCaffeParser.h"
#include "NvInfer.h"
#include <cuda_runtime_api.h>

#include <cassert>>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <sstream>
#include <thread>
#include <vector>

namespace
{
class TomLogger : public nvinfer1::ILogger
{
public:
    void log(nvinfer1::ILogger::Severity severity, const char* msg) override {}
} g_logger;
} // namespace

const std::string gSampleName = "TensorRT.sample_fasterRCNN";

//!
//! \brief The SampleFasterRCNNParams structure groups the additional parameters required by
//!         the FasterRCNN sample.
//!
struct SampleFasterRCNNParams : public samplesCommon::CaffeSampleParams
{
    int outputClsSize; //!< The number of output classes
    int nmsMaxOut;     //!< The maximum number of detection post-NMS
};

//! \brief  The SampleFasterRCNN class implements the FasterRCNN sample
//!
//! \details It creates the network using a caffe model
//!
class SampleFasterRCNN
{
    template <typename T>
    using SampleUniquePtr = std::unique_ptr<T, samplesCommon::InferDeleter>;

public:
    SampleFasterRCNN(const SampleFasterRCNNParams& params)
        : mParams(params)
        , mEngine(nullptr)
    {
    }

    //!
    //! \brief Function builds the network engine
    //!
    bool build();

    //!
    //! \brief Runs the TensorRT inference engine for this sample
    //!
    bool infer();

    //!
    //! \brief Cleans up any state created in the sample class
    //!
    bool teardown();

private:
    SampleFasterRCNNParams mParams; //!< The parameters for the sample.

    nvinfer1::Dims mInputDims; //!< The dimensions of the input to the network.

    static const int kIMG_CHANNELS = 3;
    static const int kIMG_H = 375;
    static const int kIMG_W = 500;
    std::vector<samplesCommon::PPM<kIMG_CHANNELS, kIMG_H, kIMG_W>> mPPMs; //!< PPMs of test images

    std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network

    //!
    //! \brief Parses a Caffe model for FasterRCNN and creates a TensorRT network
    //!
    void constructNetwork(SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser,
        SampleUniquePtr<nvinfer1::IBuilder>& builder, SampleUniquePtr<nvinfer1::INetworkDefinition>& network,
        SampleUniquePtr<nvinfer1::IBuilderConfig>& config);

    //!
    //! \brief Reads the input and mean data, preprocesses, and stores the result in a managed buffer
    //!
    bool processInput(const samplesCommon::BufferManager& buffers);

    //!
    //! \brief Filters output detections, handles post-processing of bounding boxes and verify results
    //!
    bool verifyOutput(const samplesCommon::BufferManager& buffers);

    //!
    //! \brief Performs inverse bounding box transform and clipping
    //!
    void bboxTransformInvAndClip(const float* rois, const float* deltas, float* predBBoxes, const float* imInfo,
        const int N, const int nmsMaxOut, const int numCls);

    //!
    //! \brief Performs non maximum suppression on final bounding boxes
    //!
    std::vector<int> nonMaximumSuppression(std::vector<std::pair<float, int>>& scoreIndex, float* bbox,
        const int classNum, const int numClasses, const float nmsThreshold);
};

//!
//! \brief Creates the network, configures the builder and creates the network engine
//!
//! \details This function creates the FasterRCNN network by parsing the caffe model and builds
//!          the engine that will be used to run FasterRCNN (mEngine)
//!
//! \return Returns true if the engine was created successfully and false otherwise
//!
bool SampleFasterRCNN::build()
{
    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(g_logger));
    if (!builder)
    {
        return false;
    }

    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
    if (!network)
    {
        return false;
    }

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
    if (!config)
    {
        return false;
    }

    auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
    if (!parser)
    {
        return false;
    }
    constructNetwork(parser, builder, network, config);

    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
        builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());
    if (!mEngine)
    {
        return false;
    }

    assert(network->getNbInputs() == 2);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;
}

//!
//! \brief Uses a caffe parser to create the FasterRCNN network and marks the
//!        output layers
//!
//! \param network Pointer to the network that will be populated with the FasterRCNN network
//!
//! \param builder Pointer to the engine builder
//!
void SampleFasterRCNN::constructNetwork(SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser,
    SampleUniquePtr<nvinfer1::IBuilder>& builder, SampleUniquePtr<nvinfer1::INetworkDefinition>& network,
    SampleUniquePtr<nvinfer1::IBuilderConfig>& config)
{
    const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor
        = parser->parse(locateFile(mParams.prototxtFileName, mParams.dataDirs).c_str(),
            locateFile(mParams.weightsFileName, mParams.dataDirs).c_str(), *network, nvinfer1::DataType::kFLOAT);

    for (auto& s : mParams.outputTensorNames)
    {
        network->markOutput(*blobNameToTensor->find(s.c_str()));
    }

    builder->setMaxBatchSize(mParams.batchSize);
    config->setMaxWorkspaceSize(16_MiB);
    samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);
}

//!
//! \brief Runs the TensorRT inference engine for this sample
//!
//! \details This function is the main execution function of the sample. It allocates the buffer,
//!          sets inputs and executes the engine.
//!
bool SampleFasterRCNN::infer()
{
    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    if (!context)
    {
        return false;
    }

    // Read the input data into the managed buffers
    assert(mParams.inputTensorNames.size() == 2);
    if (!processInput(buffers))
    {
        return false;
    }

    // Memcpy from host input buffers to device input buffers
    buffers.copyInputToDevice();

    bool status = context->execute(mParams.batchSize, buffers.getDeviceBindings().data());
    if (!status)
    {
        return false;
    }

    // Memcpy from device output buffers to host output buffers
    buffers.copyOutputToHost();

    // Post-process detections and verify results
    if (!verifyOutput(buffers))
    {
        return false;
    }

    return true;
}

//!
//! \brief Cleans up any state created in the sample class
//!
bool SampleFasterRCNN::teardown()
{
    //! Clean up the libprotobuf files as the parsing is complete
    //! \note It is not safe to use any other part of the protocol buffers library after
    //! ShutdownProtobufLibrary() has been called.
    nvcaffeparser1::shutdownProtobufLibrary();
    return true;
}

//!
//! \brief Reads the input and mean data, preprocesses, and stores the result in a managed buffer
//!
bool SampleFasterRCNN::processInput(const samplesCommon::BufferManager& buffers)
{
    const int inputC = mInputDims.d[0];
    const int inputH = mInputDims.d[1];
    const int inputW = mInputDims.d[2];
    const int batchSize = mParams.batchSize;

    // Available images
    const std::vector<std::string> imageList = {"000456.ppm", "000542.ppm", "001150.ppm", "001763.ppm", "004545.ppm"};
    mPPMs.resize(batchSize);
    assert(mPPMs.size() <= imageList.size());

    // Fill im_info buffer
    float* hostImInfoBuffer = static_cast<float*>(buffers.getHostBuffer("im_info"));
    for (int i = 0; i < batchSize; ++i)
    {
        readPPMFile(locateFile(imageList[i], mParams.dataDirs), mPPMs[i]);
        hostImInfoBuffer[i * 3] = float(mPPMs[i].h);     // Number of rows
        hostImInfoBuffer[i * 3 + 1] = float(mPPMs[i].w); // Number of columns
        hostImInfoBuffer[i * 3 + 2] = 1;                 // Image scale
    }

    // Fill data buffer
    float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("data"));
    // Pixel mean used by the Faster R-CNN's author
    const float pixelMean[3]{102.9801f, 115.9465f, 122.7717f}; // Also in BGR order
    for (int i = 0, volImg = inputC * inputH * inputW; i < batchSize; ++i)
    {
        for (int c = 0; c < inputC; ++c)
        {
            // The color image to input should be in BGR order
            for (unsigned j = 0, volChl = inputH * inputW; j < volChl; ++j)
                hostDataBuffer[i * volImg + c * volChl + j] = float(mPPMs[i].buffer[j * inputC + 2 - c]) - pixelMean[c];
        }
    }

    return true;
}

//!
//! \brief Filters output detections and handles post-processing of bounding boxes, verify result
//!
//! \return whether the detection output matches expectations
//!
bool SampleFasterRCNN::verifyOutput(const samplesCommon::BufferManager& buffers)
{
    const int batchSize = mParams.batchSize;
    const int nmsMaxOut = mParams.nmsMaxOut;
    const int outputClsSize = mParams.outputClsSize;
    const int outputBBoxSize = mParams.outputClsSize * 4;

    const float* imInfo = static_cast<const float*>(buffers.getHostBuffer("im_info"));
    const float* deltas = static_cast<const float*>(buffers.getHostBuffer("bbox_pred"));
    const float* clsProbs = static_cast<const float*>(buffers.getHostBuffer("cls_prob"));
    float* rois = static_cast<float*>(buffers.getHostBuffer("rois"));

    // Unscale back to raw image space
    for (int i = 0; i < batchSize; ++i)
    {
        for (int j = 0; j < nmsMaxOut * 4 && imInfo[i * 3 + 2] != 1; ++j)
        {
            rois[i * nmsMaxOut * 4 + j] /= imInfo[i * 3 + 2];
        }
    }

    std::vector<float> predBBoxes(batchSize * nmsMaxOut * outputBBoxSize, 0);
    bboxTransformInvAndClip(rois, deltas, predBBoxes.data(), imInfo, batchSize, nmsMaxOut, outputClsSize);

    const float nmsThreshold = 0.3f;
    const float score_threshold = 0.8f;
    const std::vector<std::string> classes{"background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car",
        "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa",
        "train", "tvmonitor"};

    // The sample passes if there is at least one detection for each item in the batch
    bool pass = true;

    for (int i = 0; i < batchSize; ++i)
    {
        float* bbox = predBBoxes.data() + i * nmsMaxOut * outputBBoxSize;
        const float* scores = clsProbs + i * nmsMaxOut * outputClsSize;
        int numDetections = 0;
        for (int c = 1; c < outputClsSize; ++c) // Skip the background
        {
            std::vector<std::pair<float, int>> scoreIndex;
            for (int r = 0; r < nmsMaxOut; ++r)
            {
                if (scores[r * outputClsSize + c] > score_threshold)
                {
                    scoreIndex.push_back(std::make_pair(scores[r * outputClsSize + c], r));
                    std::stable_sort(scoreIndex.begin(), scoreIndex.end(),
                        [](const std::pair<float, int>& pair1, const std::pair<float, int>& pair2) {
                            return pair1.first > pair2.first;
                        });
                }
            }

            // Apply NMS algorithm
            const std::vector<int> indices = nonMaximumSuppression(scoreIndex, bbox, c, outputClsSize, nmsThreshold);

            numDetections += static_cast<int>(indices.size());

            // Show results
            for (unsigned k = 0; k < indices.size(); ++k)
            {
                const int idx = indices[k];
                const std::string storeName
                    = classes[c] + "-" + std::to_string(scores[idx * outputClsSize + c]) + ".ppm";
                std::cout << "Detected " << classes[c] << " in " << mPPMs[i].fileName << " with confidence "
                          << scores[idx * outputClsSize + c] * 100.0f << "% "
                          << " (Result stored in " << storeName << ")." << std::endl;

                const samplesCommon::BBox b{bbox[idx * outputBBoxSize + c * 4], bbox[idx * outputBBoxSize + c * 4 + 1],
                    bbox[idx * outputBBoxSize + c * 4 + 2], bbox[idx * outputBBoxSize + c * 4 + 3]};
                writePPMFileWithBBox(storeName, mPPMs[i], b);
            }
        }
        pass &= numDetections >= 1;
    }
    return pass;
}

//!
//! \brief Performs inverse bounding box transform
//!
void SampleFasterRCNN::bboxTransformInvAndClip(const float* rois, const float* deltas, float* predBBoxes,
    const float* imInfo, const int N, const int nmsMaxOut, const int numCls)
{
    for (int i = 0; i < N * nmsMaxOut; ++i)
    {
        float width = rois[i * 4 + 2] - rois[i * 4] + 1;
        float height = rois[i * 4 + 3] - rois[i * 4 + 1] + 1;
        float ctr_x = rois[i * 4] + 0.5f * width;
        float ctr_y = rois[i * 4 + 1] + 0.5f * height;
        const float* imInfo_offset = imInfo + i / nmsMaxOut * 3;
        for (int j = 0; j < numCls; ++j)
        {
            float dx = deltas[i * numCls * 4 + j * 4];
            float dy = deltas[i * numCls * 4 + j * 4 + 1];
            float dw = deltas[i * numCls * 4 + j * 4 + 2];
            float dh = deltas[i * numCls * 4 + j * 4 + 3];
            float pred_ctr_x = dx * width + ctr_x;
            float pred_ctr_y = dy * height + ctr_y;
            float pred_w = exp(dw) * width;
            float pred_h = exp(dh) * height;
            predBBoxes[i * numCls * 4 + j * 4]
                = std::max(std::min(pred_ctr_x - 0.5f * pred_w, imInfo_offset[1] - 1.f), 0.f);
            predBBoxes[i * numCls * 4 + j * 4 + 1]
                = std::max(std::min(pred_ctr_y - 0.5f * pred_h, imInfo_offset[0] - 1.f), 0.f);
            predBBoxes[i * numCls * 4 + j * 4 + 2]
                = std::max(std::min(pred_ctr_x + 0.5f * pred_w, imInfo_offset[1] - 1.f), 0.f);
            predBBoxes[i * numCls * 4 + j * 4 + 3]
                = std::max(std::min(pred_ctr_y + 0.5f * pred_h, imInfo_offset[0] - 1.f), 0.f);
        }
    }
}

//!
//! \brief Performs non maximum suppression on final bounding boxes
//!
std::vector<int> SampleFasterRCNN::nonMaximumSuppression(std::vector<std::pair<float, int>>& scoreIndex, float* bbox,
    const int classNum, const int numClasses, const float nmsThreshold)
{
    auto overlap1D = [](float x1min, float x1max, float x2min, float x2max) -> float {
        if (x1min > x2min)
        {
            std::swap(x1min, x2min);
            std::swap(x1max, x2max);
        }
        return x1max < x2min ? 0 : std::min(x1max, x2max) - x2min;
    };

    auto computeIoU = [&overlap1D](float* bbox1, float* bbox2) -> float {
        float overlapX = overlap1D(bbox1[0], bbox1[2], bbox2[0], bbox2[2]);
        float overlapY = overlap1D(bbox1[1], bbox1[3], bbox2[1], bbox2[3]);
        float area1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1]);
        float area2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1]);
        float overlap2D = overlapX * overlapY;
        float u = area1 + area2 - overlap2D;
        return u == 0 ? 0 : overlap2D / u;
    };

    std::vector<int> indices;
    for (auto i : scoreIndex)
    {
        const int idx = i.second;
        bool keep = true;
        for (unsigned k = 0; k < indices.size(); ++k)
        {
            if (keep)
            {
                const int kept_idx = indices[k];
                float overlap = computeIoU(
                    &bbox[(idx * numClasses + classNum) * 4], &bbox[(kept_idx * numClasses + classNum) * 4]);
                keep = overlap <= nmsThreshold;
            }
            else
            {
                break;
            }
        }
        if (keep)
        {
            indices.push_back(idx);
        }
    }
    return indices;
}

//!
//! \brief Initializes members of the params struct using the command line args
//!
SampleFasterRCNNParams initializeSampleParams(const samplesCommon::Args& args)
{
    SampleFasterRCNNParams params;
    if (args.dataDirs.empty()) //!< Use default directories if user hasn't provided directory paths
    {
        params.dataDirs.push_back("data/faster-rcnn/");
        params.dataDirs.push_back("data/samples/faster-rcnn/");
    }
    else //!< Use the data directory provided by the user
    {
        params.dataDirs = args.dataDirs;
    }
    params.prototxtFileName = "faster_rcnn_test_iplugin.prototxt";
    params.weightsFileName = "VGG16_faster_rcnn_final.caffemodel";
    params.inputTensorNames.push_back("data");
    params.inputTensorNames.push_back("im_info");
    params.batchSize = 5;
    params.outputTensorNames.push_back("bbox_pred");
    params.outputTensorNames.push_back("cls_prob");
    params.outputTensorNames.push_back("rois");
    params.dlaCore = args.useDLACore;

    params.outputClsSize = 21;
    params.nmsMaxOut
        = 300; // This value needs to be changed as per the nmsMaxOut value set in RPROI plugin parameters in prototxt

    return params;
}

//!
//! \brief Prints the help information for running this sample
//!
void printHelpInfo()
{
    std::cout
        << "Usage: ./sample_fasterRCNN [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]"
        << std::endl;
    std::cout << "--help          Display help information" << std::endl;
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "data/samples/faster-rcnn/ and data/faster-rcnn/"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
}

int main(int argc, char** argv)
{
    samplesCommon::Args args;
    bool argsOK = samplesCommon::parseArgs(args, argc, argv);
    if (!argsOK)
    {
        std::cerr << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }

    initLibNvInferPlugins(&g_logger, "");

    auto buildSample = [&]() -> void {
        SampleFasterRCNN sample(initializeSampleParams(args));

        std::cout << "Building and running a GPU inference engine for FasterRCNN" << std::endl;

        assert(sample.build());
    };

    int const numEngines = 8;
    std::vector<std::thread> initThreads;

    for (int ix = 0; ix < numEngines; ++ix)
    {
        initThreads.emplace_back(buildSample);
    }

    for (auto& t : initThreads)
    {
        t.join();
    }

    return EXIT_SUCCESS;
}
1 Like

I discovered something that’s possibly relevant. Surprisingly to me, certain protobuf parsing functions (which appear to be pure functions) are actually not thread safe. See https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/text_format.cc#L213, and you can detect this sort of thing pretty easily with ThreadSanitizer. The (to-be-deprecated) NvCaffeParser definitely uses some of these functions (eg, see https://github.com/NVIDIA/TensorRT/blob/07ed9b57b1ff7c24664388e5564b17f7ce2873e5/parsers/caffe/caffeParser/caffeParser.cpp#L320), but may not be hitting such functions in my example. I could run a quick experiment locking the parser itself and seeing if we still hit these issues.

Cheers,
Tom

I ran that experiment (guarding parsing functions with a mutex), and it unfortunately does not solve this issue. The point about protobuf parsing thread safety may still be relevant to your parsers (including the ONNX parser).

Thanks for the updates Tom, I’ve escalated the issue. I’ll let you know when I hear back.

1 Like

Linux distro and version: Ubuntu 16.04.6 LTS
GPU type: 2080Ti
Nvidia driver version: 430
CUDA version: 10.0
CUDNN version: 7.6.3
TensorRT version: 7.0.0.1
TensorFlow version: 2.2

We met the same problem when we tried to initialize the model in parallel. We have multiple models of tftrt and pure trt. Model loading and warmup will be carried out during system initialization. Since there are many models, we hope to speed up the startup time through parallel initialization. But now it looks like this is not thread safe.

Is there any update on this issue?