System freezes when our application accesses cuda

We have our custom BSP (based on L4T32.4.3) for AGX Xavier and custom carrier board.

One of our customer reported below issue. can you please suggest on this?
I have been troubleshooting a problem on the custom carrier board with AGX xavier that causes tensorflow not to function. The problem appears upon reboot after the installation of tensorflow’s dependencies through the metapackage nvidia-jetpack.

Our application is functional before the reboot and is able to utilize the new packages. After reboot the system freezes when our application accesses cuda through tensorflow (see attached dmesg output).

It seems that the gpu driver crashes. I have included the following: apt history log, outputs of the cuda utilities deviceQuery and bandwidthTest before and after restarting, dmesg output after the error.

I have referred to the following nvidia forum link describing a similar problem: Cuda hangs after installation of jetpack and reboot - #4 by AastaLLL

apt_history.txt (8.7 KB)
bandwidthTest_after_reboot.txt (114 Bytes)
apt_history.txt (8.7 KB)
bandwidthTest_after_reboot.txt (114 Bytes)
bandwidthTest_before_reboot.txt (585 Bytes)
deviceQuery_after_reboot.txt (2.3 KB)
deviceQuery_after_reboot.txt (2.3 KB)
dmesg_bandwidthTest_after_reboot.txt (16.1 KB)

Hi,

Since we have some newer BSP, is it possible to upgrade and try it again?
It will be good if you can test this on JetPack 4.6 (rel-32.6.1).

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.