nvinfer1::ICudaEngine::createExecutionContext returns nullptr!

I’m attempting to run multiple inferences in parallel using streams but I’m running into an interesting error.

My take is that you need 1 optimization profile per stream which is fine but when I attempt to use 3 streams, my code will segfault.

I’m using TensorRT 7.0.0 on Ubuntu 18.04 with a GTX 1650 and CUDA 10.2

This is my code:

for (std::size_t idx = 0; idx < streams.size(); ++idx) {
  auto& context = contexts[idx];

  context = make_unique(engine.createExecutionContext());

  BOOST_ASSERT(context);
  BOOST_ASSERT(idx >= 0);
  BOOST_ASSERT(idx < context->getEngine().getNbOptimizationProfiles());

  std::cout << "attempting to set the optimization profile...: " << idx << " of "
            << context->getEngine().getNbOptimizationProfiles() << "\n";

  context->setOptimizationProfile(idx);

  std::cout << "done!\n";

  std::cout << "setting binding dimensions for the binding: "
            << context->getEngine().getBindingName(2 * static_cast<int>(idx)) << "\n";

  context->setBindingDimensions(2 * static_cast<int>(idx), nvinfer1::Dims4(batch_size, 3, 120, 120));
}
BOOST_ASSERT(context);

is failing for me and I’m not sure why. There doesn’t seem to be any info in the API docs about this.

Ha, my ILogger has a “quiet” mode which I forgot to turn on.

Turns out, I’m hitting an OOM error:

../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)
FAILED_ALLOCATION: std::exception

Which is interesting because I thought I was setting up the memory correctly.

All I should have to do is invoke:

config->setMaxWorkspaceSize(1 << 30);

and that’ll make it Just Work, right?

Hi,

Usually a workspace-size related error would output a warning that your workspace isn’t big enough and that it may impact performance.

For an OOM error, this is usually because of the total system’s GPU memory. You can run watch -n 0.1 nvidia-smi in a separate shell while this code is executing to watch your GPU memory growth to see if it actually runs out as expected.

For multi-threaded inference, I believe you need one execution context per thread for thread safety/correctness: https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#thread-safety

After that point, I think it will be a performance/memory tradeoff. If doing asynchronous inference (using streams), then you’ll likely prefer to use one stream per thread to get the parallel performance, but that will increase memory usage: https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#streaming

For additional performance, you might want one execution profile per context to avoid switching the context’s profile during runtime, but again this will increase memory usage.

etc.

Hmm, I’m not using multiple threads. Instead, I’m creating streams like this:

auto make_stream() -> stream
{
  ::cudaStream_t stream_handle;

  auto ec = ::cudaStreamCreateWithFlags(&stream_handle, cudaStreamNonBlocking);
  if (ec != ::cudaError_t::cudaSuccess) {
    throw ec;
  }

  return {stream_handle, {}};
}

where stream is:

namespace detail {
struct stream_deleter {
  using pointer = ::cudaStream_t;

  auto operator()(pointer stream_handle) -> void
  {
    ::cudaStreamDestroy(stream_handle);
  }
};
}    // namespace detail

using stream = std::unique_ptr<::cudaStream_t, detail::stream_deleter>;

The main host thread instead simply launches kernels asynchronously on separate streams like this:

::cudaMemcpyAsync(bindings[2 * batch_idx], pinned_blob.get(), sizeof(float) * floats_per_batch,
                  ::cudaMemcpyHostToDevice, stream_handle);

context->enqueueV2(bindings.data(), stream_handle, nullptr);

::cudaMemcpyAsync(pinned_inference.get(), bindings[2 * batch_idx + 1], sizeof(float) * infer_out.size(),
                  ::cudaMemcpyDeviceToHost, stream_handle);

I then synchronize everything with a call to cudaDeviceSynchronize();.

I’m not sure if this is correct or not.

I have taken care to make sure I have N profiles = N streams and that I register each profile with each execution context. The number of bindings, for example, returns what it should: 2 * number of streams. In this case, the 2 comes from our net having only a single input and output layer (hence only 2 bindings per profile).

Each context is assigned a unique profile too.

I tried using watch nvdidia-smi and it didn’t seem like I should be running out of memory. I’m not sure if there’s anything else I should do to help debug this.

That looks and sounds reasonable to me. Perhaps you could try to use CUDA-memcheck to better debug your issue: https://developer.nvidia.com/cuda-memcheck. I haven’t used it myself yet, but seems like it could be helpful here.

Also just it case it wasn’t a typo, it might be worth adding the -n 0.1 to your watch -n 0.1 nvidia-smi command, because otherwise, the default value of -n is 2 seconds, which is probably way too slow to actual watch the OOM happen.

Heh, was definitely just a truncation.

I played around a bit more. This seems like a genuine OOM error. For example, I’m able to have 4 streams run inferences with a batch size of 1 to 2 just fine but when I up it to a batch size of 3 inferences per stream, I get the OOM error.

What’s even more interesting is that when I do run the code through cuda-memcheck, it doesn’t segfault. My guess is that this is because the memchecker slows down the program execution which keeps its concurrent memory usage below the OOM limit.

I just wanted to make sure that I wasn’t doing anything abhorrently wrong but it seems like everything is working as expected, I just need to code for my hardware which is more than doable.

Thank you for the help!