I noticed deserializeCudaEngine and createExecutionContext creates 12 cuda streams each.
My application loads 12 DNN models and I end up with 288 streams created just by TensorRT + a few more streams that my application creates. this gives a hard time for Nsight which sometimes fails to capture when so many streams are used.
Does TensorRT really need this many streams? How can this be avoided?