I have a pipeline which involves inferencing using driveNet and feeding the output of car detections to two DNNs (binary classifiers to detect Volkswagen or car). I have modified the for loop in the original driveworks sample(looping over each detection of each class).
I am then using std::async to create multiple threads that will do inferencing using any 1 classifier.
My goal is to horizontally scale (more classifiers) so as to reduce the overall process time.
My issue is when i use a single classifier the inference time is 2ms, but when I use 2 or 3 classifiers the inference time is 4ms and 6ms respectively per classifier.
Looks like each classifier is running sequentially.
How do I keep the inference time same?
below is my code: for calling just dnn1, I have similarly another function for dnn2.
void Classifier_1(dwObjectDataCamera objCameraData)
dwImageCUDA* rgbaImage = nullptr;
CHECK_DW_ERROR(dwDataConditioner_prepareData(m_dnnInputDevice1, &rgbaImage, 1, &roi,
CHECK_DW_ERROR(dwDNN_infer(&m_dnnOutputsDevice1, &m_dnnInputDevice1, m_dnn1));
sizeof(float32_t) * m_totalSizesOutput, cudaMemcpyDeviceToHost));
m_dnnLabelList.push_back("Car"); // class info is pushed in list
Could you check if all the classifiers are launched in the same CUDA Stream?
Yes there was only one cudastream the default one (0). I created two new streams by the following way:
Initializing the Classifier with newly created streams
It is difficult to make a comment with the given information.
In general, Two GPU operations can run in parallel provided there is enough resources(like computation power) on GPU, both are launched from different CUDA streams and should be asynchronous in nature.
sizeof(float32_t) * m_totalSizesOutput, cudaMemcpyDeviceToHost)
cudaMemcpy operation runs in a default stream and hence can not overlap with other operations in another streams. You need to create non blocking CUDA streams and use cudaMemcpyAsync data trasnfer calls. Note that cudaMemcpyAsync() required pinned memory on host. Check http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf to understand CUDA streams.
Check creating non blocking CUDA streams and use cudaMemcpyAsyn() and see if it helps
You mentioned about using pinned memory before performing cudaMemcpyAysnc. As I understand using pinned memory optimizes the copying of information from host to device.
I need to create and copy a pinned rgba image the classifier.
Is the below code correct with respect to driveworks APIs???
dwImageCUDA* rgbaImage = nullptr; // A pointer to the dwImageCUDA pointer (pageable)
dwImage_getCUDA(&rgbaImage, m_imageRGBA); //m_imageRGBA is a dwImageHandle having a rgba image (pageable)
dwImageCUDA* pinned_rgbaImage = nullptr;
unsigned int bytes = sizeof(*rgbaImage); // Calculate size of image
cudaMallocHost((void**)&pinnedrgbaImage, bytes);// host pinned
dwImage_getCUDA(&pinnedrgbaImage, m_imageRGBA); //Image info copied to pinned memory
CHECK_DW_ERROR(dwDataConditioner_prepareData(m_dnnInputDevice, &pinnedrgbaImage, 1, &roi,
cudaAddressModeClamp, m_dataConditioner)); //Having behavior as cudamemcpy (HtoD)
Below are 2images
image 1) 1 classifier having an inference time of 2ms.
2) 4 classifier running on different streams in parallel having an inference time of 6ms instead of 2ms.
Why are there pauses in between Kernels of each classifiers?
Could you please attach both nvprof files.
Thanks @SivaRamaKrishnaNV for the reply. I cannot upload nvvp files using the upload option. it accepts only (jpg, jpeg, png, gif, log)
Any update on the above issue
I could see that compute kernels from different threads are overlapping.
Why there are pauses in between Kernels of each classifiers?
You could see that due to overlap of kernels from different threads, you notice gaps the kernel execution time line of one thread.
I ran two instances of driveNet in parallel and observed the following:
Not all the kernels of the network can run in parallel even though they are on different streams.
The benefit is only for those kernels that can run in parallel.
Does distributing them on 2 different devices help ( Tegra A, Tegra B)?
a GPU can run two CUDA kerenls in parallel if there are enough resources.
Does distributing them on 2 different devices help
Yes. Note that two tegras are like seperate system. You can run two different applications on Two Tegras. If they need any synchronisation and sharing of data, it has to transferred via network
Thanks for the reply. Can you provide any example code on how to run two different applications on two Tegras?
There is no sample application to make use of both tegras. As I said, they are like separate systems. You need to run your application seperately on each of tegra and need to share data via network.