Abrubt shutdown with tensorflow

Under Jetpack 4.2, when I initialize tensorflow and load a model, sometimes Xavier AGX shutdowns abruptly, and I cannot find any diagnostics whatsoever to investigate the problem.

According to jtop, temperature is around 50C, so it does not seem that a HW shutdown due to overheating takes place.
Also system memory and other resource usage seem normal.

I have tried both default and MAXN power mode (but the target has been flashed under default configuration, not the maxn variant).

Any help would be appreciated, thank you!

Perhaps it ran out of RAM. Use another computer, e.g., via ssh or serial console, and monitor something like htop or:
watch -n 1 free -h
…see if this approaches zero spare near the moment of shutdown.

perhaps running out of ram could be mitigated with use of swap files
the default swapfile seems to be 8gb which can be resized or monitored e.g. with system monitor or

I think I do not have swap file enabled at all, how could I check it?

@krikun.daniel the command proposed by @linuxdev will reflect the status of swap
e.g.


              total        used        free      shared  buff/cache   available
Mem:            15G        2.2G        9.6G         49M        3.7G         13G
Swap:          7.7G          0B        7.7G

You should also know that operations which require physical memory will not be helped by swap. CUDA operations using the GPU would run out of RAM and fail in the same way even if you add swap. So it is good to determine first if this is a case of not enough RAM.

If this is not enough RAM, then some aspect of the program might be changed to use less RAM (fewer concurrent kernels for example).

I saw that it has been brought up previously here:
https://forums.developer.nvidia.com/t/xavier-suddenly-cuts-its-own-power/69962

I am using the original power supply in maxn mode.

What was the last “watch -n 1 free -h” output upon shutdown?

I remember that at very early stage tensorflow wouldn’t even install on Jetson TX if there was no SWAP file or if it was small. It would poweroff or reset.
Another issue is GPU memory limitations. Also RAM limitations might be a separate issue.
@krikun.daniel are you using tensorflow_cpu or tensorflow_gpu installation?