hello, with reference to following ticket i am using 2 extra secondary preprocesses after PGIE to provide embeddings to my model’s non image input layer during runtime.
so i wanted to know that though this method worked for me, the FPS dropped to half, when i was using static embeddings without these extra preprocesses i was getting around 14-15 FPS but after including these two it is dropping to 7-8 FPS so can you tell me why is this happening?
NOTE: i observed same drop in FPS even with the given std preprocessing and no additional custom logic
this folder contains the report generated using nsys for my deepstream app, i tried to analyse this using the GUI tool but i couldn’t find anything of use, so please if you can tell me where is the bottleneck.
In the second figure, when batch_num=1027, a large amount of time is consumed in cudaMemoryCpy in up to 4s, which is usually caused by synchronization waiting.
You need to check why cudaMemorycpy causes blocking.And optimize the custom tensor conversion function, 100+ms is too long
Thank you very much for your help, in pointing this out but can you also tell me how can i find which custom tensor is causing this as i am using 4 preprocess plugins to prepare custom tensors, one is for my person detection model, one is for recognition model ad other two are for the swap model.
can you provide any guidance on this synchronisation issue because the cudaMemcpy that i have used are used in just copying the 1x512 dimesional fp32 embeddings and copying the modified 128x128 frame in one of the preprocess
thanks for your help but i had few more queries actually as per your suggestion i am using 2 sources for input which is causing the fps drop from 30fps to 15fps but if i am using the non-img layer preprocess on single source only then fps are 22 which not much of a drop but then i am facing another problem that all faces are being swapped except every 12th image and also i used a counter in the non-img layer preprocess and custom parser of face swapper model then i observed one thing that
if non-img preprocess is enabled then parser count is 924 and preprocesses count is 462
if non-img preprocess is not enabled then parser count is 462
Yes, we know this. The point we want to drive home here is that if we use single source then the fps drop is not significant. So we want to go with single source. But with single source we found that every 12 image generated out of swap model is black/white. So we want to know how to resolve this. We want to go with single source only as it seems that it can work and give better performance and not with two sources for each of the pre-process.
We put a static counter in the custom pre-process and custom parser to check how many times it is called. We found that with two pre-process one of image layer and other for non-image layer the custom parser is called twice. Why?
No, currently we are using nvinfer only. Will nvinferserver (I guess Triton server) help in improving the performance? What are the other options to improve the performance like
Currently in the pipeline Yolo and Human Attribute model are not in the purview of tracker since Yolo is working as SGIE and custom preprocess with Full Frame and Human attribute is operating on Yolo output. So any proposal to improve here.
If you have multiple cameras, increasing the batch size will usually improve performance. If you only have one camera, it will not improve performance.
Parallel inference does not improve performance either. This is just to show how to run multiple models simultaneously in a pipeline.
Use CUDA for preprocessing, optimize the model (use Int8 reasoning or other optimizations), and use the above-mentioned Nsight to analyze performance bottlenecks