Running 2,3 DNNs in parallel, effect on inference time

I have a pipeline which involves inferencing using driveNet and feeding the output of car detections to two DNNs (binary classifiers to detect Volkswagen or car). I have modified the for loop in the original driveworks sample(looping over each detection of each class).
I am then using std::async to create multiple threads that will do inferencing using any 1 classifier.

My goal is to horizontally scale (more classifiers) so as to reduce the overall process time.

My issue is when i use a single classifier the inference time is 2ms, but when I use 2 or 3 classifiers the inference time is 4ms and 6ms respectively per classifier.
Looks like each classifier is running sequentially.
How do I keep the inference time same?
below is my code: for calling just dnn1, I have similarly another function for dnn2.
calling command:

void Classifier_1(dwObjectDataCamera objCameraData)
uint32_t x=objCameraData.box2D.x;
uint32_t y=objCameraData.box2D.y;
uint32_t width=objCameraData.box2D.width;
uint32_t height=objCameraData.box2D.height;
dwRect roi;
std::lock_guardstd::mutex mtx1_lock(s_Mutex1);
auto starti=std::chrono::high_resolution_clock::now();
dwImageCUDA* rgbaImage = nullptr;
dwImage_getCUDA(&rgbaImage, m_imageRGBA);
CHECK_DW_ERROR(dwDataConditioner_prepareData(m_dnnInputDevice1, &rgbaImage, 1, &roi,
cudaAddressModeClamp, m_dataConditioner1));
CHECK_DW_ERROR(dwDNN_infer(&m_dnnOutputsDevice1, &m_dnnInputDevice1, m_dnn1));
CHECK_CUDA_ERROR(cudaMemcpy(m_dnnOutputsHost1.get(), m_dnnOutputsDevice1,
sizeof(float32_t) * m_totalSizesOutput, cudaMemcpyDeviceToHost));
auto endi=std::chrono::high_resolution_clock::now();
if ((m_dnnOutputsHost1.get()+1)>(m_dnnOutputsHost1.get()))

  	 m_dnnLabelList[0].push_back("Car");  // class info is pushed in list

Dear @a.a.menezes,
Could you check if all the classifiers are launched in the same CUDA Stream?

Yes there was only one cudastream the default one (0). I created two new streams by the following way:
cudaStream_t c1;
cudaStream_t c2;


Initializing the Classifier with newly created streams
CHECK_DW_ERROR(dwDNN_setCUDAStream(c1, m_dnn1));
CHECK_DW_ERROR(dwDataConditioner_initialize(&m_dataConditioner1, &m_networkInputDimensions,
&metadata.dataConditionerParams, c1,

Dear @a.a.menezes,
It is difficult to make a comment with the given information.
In general, Two GPU operations can run in parallel provided there is enough resources(like computation power) on GPU, both are launched from different CUDA streams and should be asynchronous in nature.

cudaMemcpy(m_dnnOutputsHost1.get(), m_dnnOutputsDevice1,
sizeof(float32_t) * m_totalSizesOutput, cudaMemcpyDeviceToHost)

cudaMemcpy operation runs in a default stream and hence can not overlap with other operations in another streams. You need to create non blocking CUDA streams and use cudaMemcpyAsync data trasnfer calls. Note that cudaMemcpyAsync() required pinned memory on host. Check to understand CUDA streams.

Check creating non blocking CUDA streams and use cudaMemcpyAsyn() and see if it helps

You mentioned about using pinned memory before performing cudaMemcpyAysnc. As I understand using pinned memory optimizes the copying of information from host to device.
I need to create and copy a pinned rgba image the classifier.
Is the below code correct with respect to driveworks APIs???

dwImageCUDA* rgbaImage = nullptr; // A pointer to the dwImageCUDA pointer (pageable)
dwImage_getCUDA(&rgbaImage, m_imageRGBA); //m_imageRGBA is a dwImageHandle having a rgba image (pageable)

dwImageCUDA* pinned_rgbaImage = nullptr;
   unsigned int bytes = sizeof(*rgbaImage);   // Calculate size of image  
   cudaMallocHost((void**)&pinnedrgbaImage, bytes);// host pinned

   dwImage_getCUDA(&pinnedrgbaImage, m_imageRGBA);  //Image info copied to pinned memory

   CHECK_DW_ERROR(dwDataConditioner_prepareData(m_dnnInputDevice[0], &pinnedrgbaImage, 1, &roi,

                                                        cudaAddressModeClamp, m_dataConditioner[0]));  //Having behavior as cudamemcpy (HtoD)


Below are 2images
image 1) 1 classifier having an inference time of 2ms.

2) 4 classifier running on different streams in parallel having an inference time of 6ms instead of 2ms.
Why are there pauses in between Kernels of each classifiers?

Dear @a.a.menezes,
Could you please attach both nvprof files.

Thanks @SivaRamaKrishnaNV for the reply. I cannot upload nvvp files using the upload option. it accepts only (jpg, jpeg, png, gif, log)


Any update on the above issue

Dear @a.a.menezes,
I could see that compute kernels from different threads are overlapping.

Why there are pauses in between Kernels of each classifiers?

You could see that due to overlap of kernels from different threads, you notice gaps the kernel execution time line of one thread.

I ran two instances of driveNet in parallel and observed the following:
Not all the kernels of the network can run in parallel even though they are on different streams.
The benefit is only for those kernels that can run in parallel.
Does distributing them on 2 different devices help ( Tegra A, Tegra B)?

Dear a.a.menezes,
a GPU can run two CUDA kerenls in parallel if there are enough resources.
Does distributing them on 2 different devices help

Yes. Note that two tegras are like seperate system. You can run two different applications on Two Tegras. If they need any synchronisation and sharing of data, it has to transferred via network

Thanks for the reply. Can you provide any example code on how to run two different applications on two Tegras?

Dear @a.a.menezes,
There is no sample application to make use of both tegras. As I said, they are like separate systems. You need to run your application seperately on each of tegra and need to share data via network.