Pipetuner is stuck on the same iteration forever

Hi,

I am using Pipetuner to optimize a YoloV8 model. I have made a groundtruth dataset and launched Pipetuner with my configuration. It is working okay for a while and then it is staying on the same iteration forever and ETA keeps increasing indefinitely. (For ex: it is stuck on iteration 17/200 and I let it run for 24H. A normal iteration is taking only 1 or 2 minutes). I am using the default “pysot” and “hyper” together. From my testing It looks like the optimizer algorithm is stuck in a loop or something but I haven’t found any way to check logs associated with the optimizer. My deepstream logs are ok.
I also have run pipetuner on a other datasets succesfully before.

Do you know if there is any way to check why Pipetuner is freezing after a certain number of iteration?

Thank you,
Best regards,
Alexandre

Can you upload all the logs?

log_DsAppRun.log (3.7 KB)
eval_cmd2.txt (658 Bytes)
log_server_2024-12-10_14-35-12.txt (89 Bytes)
log_client_2024-12-10_14-35-12.txt (14.1 KB)

Please find attached the server and client logs as well as the logs from the last succesfull DS iteration. Here it gets stuck after 8 iteration. I relaunched with exactly the same parameters, models and datasets and It got stuck at 43 iteration. I can see in eval_cam2.txt that metrics are all at 0.

Best regards,
Alexandre

Can you try running DS pipeline outside as a standalone DS, using the exact same params in the checkpoint folder where it got stuck? If PipeTuner got stuck, it could be the case that the pipeline hangs for some reason. So, I would recommend first check if the pipeline executes fine outside PipeTuner.

If the standalone pipeline hangs, check the params used, and adjust the param range in PipeTuner accordingly in such a way that the PipeTuner wouldn’t sample such values for the params.

Thanks for the suggestion, I just did and saw that deepstream was indeed crashing.
Please find attached the logs of the rerun of deepstream with Pipetuner parameters :
logs_rerun_Pipetuner.txt (2.3 KB)

I don’t know what is causing this error.

Best regards,
Alexandre

I found that in the tracker config created by pipetuner, the MaxShadowTrackingAge was too small and causing the logs from above.

Edit : I relaunched Pipetuner with a new configuration preventing MaxShadowTrackingAge being too small but it got stuck again after 7 Iteration.

But now, when I am running a separate Deepstream instance with the configuration of this iteration it goes trough normally and I got the “App run successful” at the end of the run.

Please find logs and config attached:

DsAppRun_output_20241213_063118.zip (5.2 KB)
log_deepstream_rerun.txt (2.9 KB)
log_server_2024-12-13_15-25-01.txt (89 Bytes)
log_client_2024-12-13_15-25-01.txt (10.9 KB)

Thanks for providing the logs. How does the generated output data look like? It would be the input to the evaluation step, and would need to check if the results look okay

Are you talking about the folders kitti_detector and MOT_Evaluate ?

If yes, I got empty folder with the name of the media as well as a list of text file containing detections with this format :

classname 0.0 0 0.0 1795.500000 202.125000 1920.000000 513.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.451416

It seems that they don’t contain the same values.
For example : In kitti_detector/00_000_000047.txt I have one line (the one above) and MOT_evaluate/00_000_000047.txt is empty.

I guess this is normal.
It there any way to check logs from the evaluation step ?

Best regards,
Alexandre

Hi,

I was doing some research and found out that sometimes the Deepstream isntanc was not receiving all the EOS from video files.

I have relaunched deepstream with the last pipetuner configuration and got this log :
log_DS_rerun.txt (5.0 KB)

But it seems to be random. I relaunched the deepstream 4 times, and got this types of log only two times. The other two runs, it finished succesfully. Any ideas from where it comes from ?

Best regards,
Alexandre

Hi, @Alexandre-PAI

From the log, seems there is something wrong with the tracker model engine. Can you check it?

Indeed,
I was trying the ReID of NvDCF during this run. I tried both nvidia models : resnet50_market1501_aicity156.onnx and resnet50_market1501.etlt. I kept the standard parameters described in
config_tracker_NvDCF_accuracy_ResNet50_default.txt (3.4 KB)
I downloaded the models from NGC and ajusted the path accordingly. I get the same ERROR with both models.

However, I have seen similar result with my custom tracker configuration where I was not using the ReID part of the tracker (reidType: 0). Like this one :
config_tracker_NvDCF_accuracy_custom.txt (3.3 KB)

In both cases, Pipetuner got stuck after some iterations and the rerun of deepstream showed similar behaviour (No EOS for some videos randomly)

Best regards,
Alexandre

It is wrong for the onnx model:

  tltModelKey: nvidia_tao
  tltEncodedModel: /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-fewshot-learning-app/models/mtmc/resnet50_market1501_aicity156.onnx
  modelEngineFile: /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-fewshot-learning-app/models/mtmc/resnet50_market1501_aicity156.onnx_b100_gpu0_fp16.engine

Please change to

  onnxFile: /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-fewshot-learning-app/models/mtmc/resnet50_market1501_aicity156.onnx
  modelEngineFile: /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-fewshot-learning-app/models/mtmc/resnet50_market1501_aicity156.onnx_b100_gpu0_fp16.engine

Indeed, I uploaded the default file from Pipetuner samples zip which is wrong.
I did change the tltEncodedModel to onnxFile when I was using the onnx model. But behaviour was the same. Same error.

Moreover, I had to convert the onnx (or tlt) to engine in a separate deepstream instance because the DS instance launched by Pipetuner is failing to parse the ONNX. Using the same DS image (7.0-triton-multiarch) and ds-app, the engine is converting succesfully (but it gives the error from previous logs).
With the pipetuner instance doing the conversion from ONNX to Engine I get this logs :
log_DsAppRun.log (2.7 KB)

But as told, Pipetuner is stuck also when I am not using ReID model. I am not especially interested in having the ReID model working, I am just trying to get the best tracker possible for my use case.

Best regards,
Alexandre

Hello,

I am still stuck at the same state as my previous message. Can i provide other logs to help ?

Thank you in advance.
Best regards,
Alexandre

Hi, @Alexandre-PAI :

Can you provide the logs for the case which disabled “ReID” ?

Of course,
Here it is :
Pipetuner config : POA_config.txt (9.2 KB)
Initial tracker config :
config_tracker_NvDCF_accuracy_custom.txt (3.3 KB)

Pipetuner logs (Stuck after 3 iteration) :
log_client_2025-01-07_08-37-29.log (7.9 KB)

log_server_2025-01-07_08-37-29.log (89 Bytes)

I relaunched a deepstream 7.0 instance with the exact configuration given in the lastcheckpoint :
Here is the configuration from the third iteration:
config_infer_primary.txt (779 Bytes)
config_tracker.txt (2.7 KB)
dsAppConfig_0.txt (2.1 KB)

Here the behaviour is really strange. If I launch deepstream with this configuration multiple time (with the exact same configuration) Sometimes I get the EOS for each stream and succesfull deepstream run :
logDS_rerunOK_NoReid.log (2.7 KB)

Sometimes Deepstream get stuck and display 0FPS for each stream. It seems that it does not receive the EOS for stream 0 and 1.

logDS_rerun_NotOK_NoReid.log (4.7 KB)

I guess this is why Pipetuner gets stuck after some iterations. I don’t understand why this behaviour appears randomly.

For good measure, please find also attached the deepstream log with the configuration issued from the second pipetuner iteration (which runs successfully) :
log_DsAppRun-Iteration2.log (3.8 KB)

Hello,

Are the logs I published helpfull to investigate this issue ?

Thank you in advance,
Alexandre

Can you check your configuration file dsAppConfig_0.txt? Seems you set the “live-source” for nvstreammux to 1 for your videos in your dataset. The videos are local files, can you set “live-source” to 0 to make the deepstream-app to run correctly?

Indeed.
I can change the live-source to 0 in the dsAppConfig_0.txt when I use it for the re-run tests and it is ok. But this file was generated as it is by Pipetuner. How Can I prevent Pipetuner to put live-source to 1 when starting a run ?
I use the sample script launch.txt (4.1 KB) to launch the optimisation process. I don’t have any access to the deepstream app configuration parameters and I don’t see anywhere in pipetuner files where I can change that.
Thank you again for your help,
Alexandre

The dsAppConfig_xxx.txt is generated from the template in the nvcr.io/nvidia/pipetuner:1.0 container. The template is in /pipe-tuner/configs/config_dsApp folder, you can change the “live-source” value in
the template files.