NCHWToNCHHW2 error while running a model with two output blobs

I have a modified version of GoogleNet, which has two output blobs instead of one.
I had been trying to profile this network making use of link.

For the benchmarking I made use of ‘trt-bench’ executable. I was able to get a speed of 5ms for original GoogleNet provided along with the git repo.
I was also able to get 5ms for my network if only one of my output blob name is mentioned.

When I try to give a vector of output blob names, it compiles correctly as the code supports it.
But I get this error:

[TRT] (1036) - Cuda Error in NCHWToNCHHW2: 4
[TRT] (1036) - Cuda Error in NCHWToNCHHW2: 4
[cuda]   cudaStreamSynchronize(stream)
[cuda]      unspecified launch failure (error 4) (hex 0x04)
[cuda]      /app/jetson-inference/imageNet.cpp:319
[TRT]  imageNet::Process() -- failed to enqueue TensorRT network
GPU network failed to process

Can someone guide me on what I am doing wrong. Any insights will be extremely helpful.

Thank you.


You should modify the output buffer for your customized design.
Check this code first: