My inferencing is stopped without any error

zw_sd · November 12, 2021, 7:42am

I am using jetson.inference and jetson.utils, I trained my model, and inference like imagenet.py.

The codes run well in the beginning, and could output classifications well.

But after a few hours (maybe around two hours), the scripts are still looks like running, but no any messages are outputted, and no errors outputted.

Could you please help to fix the issue?

Or any ways to get any clue?

I am running the scripts like: nohup python3 detect.py > detect.log 2>&1 &

The script could run around 6 hours when using 1 model, and without using python class.

It could run around 2 hours when using 2 models, and using python class.

Thank you.

AastaLLL · November 12, 2021, 8:00am

Hi,

Do you get some output log on the console?
If yes, would you mind sharing it with us first?

Thanks.

zw_sd · November 12, 2021, 8:30am

Just now, I got an message below, and then the script was stopped:

class 0000 - 0.994136 (0Ignor
)
class 0000 - *** stack smashing detected ***: terminated

AastaLLL · November 19, 2021, 8:02am

Hi,

It seems that you are using an RTSP source.
Would you mind testing this issue with a CSI or USB camera as well?

Thanks.

zw_sd · November 25, 2021, 2:09am

It is getting worse.
Now, when running 2 models, it can only run for 5 minutes, and then no any outputs or messages.
1 model can run for 20 minutes, (before it was 6 hours, and then 3 hours, I killed it in every 3 hour to make it work, for now it works well. But I am not sure if it can run longer.)

I am running the 1 model and 2 models at same time, connected to same RTSP cameras.
1 model runs on Nano, connected 2 cameras (A B).
2 models runs on NX, connected 2 same cameras (A B).
the same 2 models runs on TX2, connected 3 another cameras (C D E).

Now 5 minutes is very hard…

zw_sd · November 30, 2021, 1:02am

Any suggestions?
Thank you.

AastaLLL · December 9, 2021, 6:03am

Hi,

We want to reproduce this issue internally and check it deeper.
Would you mind sharing the source and detailed steps for reproducing?

Thanks.

zw_sd · December 9, 2021, 6:50am

I think the reason could be:

The inferencing works all the time, maybe there are resources leaks?
Then the script could not run long.

I kill the script using sh, that increase the resources leaks?
Then the script run shorter and shorter?

After I kill the script, is there any way to clean these resources leaks?
Are there any tools to monitor the resources usage?

Thank you.

AastaLLL · December 13, 2021, 3:17am

Hi,

If you terminated the script, the occupied memory should release immediately.
You can monitor the system status with tegrastats directly.

$ sudo tegrastats

Thanks.

zw_sd · December 14, 2021, 8:18am

Thank You. I monitor the resources, found they looks good.
At the beginning, when detection works well, the GPU is used.

After a few minutes, the detection is stopped working, the GPU usage is less.

zw_sd · December 17, 2021, 2:19am

Based on the monitor, there looks like no resource leaks.
I do not know what is causing the inference stop.
Any suggestions? Thank you.

Could you please let me know your email address, let me send the codes to you.

AastaLLL · December 23, 2021, 6:34am

Hi,

Would you mind sharing it through a private message?
More, does this issue also occur with the jetson-inference default sample and default model?

Thanks.