NCHWToNCHHW2 error while running a model with two output blobs

I have a modified version of GoogleNet, which has two output blobs instead of one.
I had been trying to profile this network making use of GitHub - dusty-nv/jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson. link.

For the benchmarking I made use of ‘trt-bench’ executable. I was able to get a speed of 5ms for original GoogleNet provided along with the git repo.
I was also able to get 5ms for my network if only one of my output blob name is mentioned.

When I try to give a vector of output blob names, it compiles correctly as the code supports it.
But I get this error:

[TRT]  reformat.cu (1036) - Cuda Error in NCHWToNCHHW2: 4
[TRT]  reformat.cu (1036) - Cuda Error in NCHWToNCHHW2: 4
[cuda]   cudaStreamSynchronize(stream)
[cuda]      unspecified launch failure (error 4) (hex 0x04)
[cuda]      /app/jetson-inference/imageNet.cpp:319
[TRT]  imageNet::Process() -- failed to enqueue TensorRT network
GPU network failed to process

Can someone guide me on what I am doing wrong. Any insights will be extremely helpful.

Thank you.

Hi,

You should modify the output buffer for your customized design.
Check this code first:
[url]https://github.com/dusty-nv/jetson-inference/blob/master/tensorNet.cpp#L326[/url]

Thanks.