Tensorflow_xla model inference crash on Jetson AGX xavier

chuanxiu.liu · March 26, 2021, 12:52pm

I met tensorflow_xla crash issue for model inference on Nvidia Jetson AGX Xavier aarch64 system.

System information

Have I written custom code: Yes, I have some CPU computing custom ops, they are in different places in the middle of the network
OS Platform and Distribution): Linux Ubuntu18.04
device: Nvidia Jetson AGX Xavier, jetpack 4.4
TensorFlow installed from (source or binary): tensorflow2.4.1 source code compiled on xavier
TensorFlow version: tensorflow-2.4.1
Python version: 3.6.8
Bazel version (if compiling from source): 3.1.0
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: cuda10.2, cudnn8.0

Issues：
1, When do model inference with xla enable, this crash can be reproduced almost every time (xavier aarch64 system)
2, When I turn off some of custom ops (CPU compute op), crash can happen about 7 times after 10 runs (xavier aarch64 system)
3, When I run same code, same tensorflow-2.4.1 version on V100 GPU and x86 system, it can run successful without any crash (x86 + v100 gpu system)

AastaLLL · March 29, 2021, 2:12am

Hi,

Is it possible to close all the custom ops?
If yes, could you give it a try?

Thanks.

chuanxiu.liu · March 30, 2021, 6:57am

After close all the custom ops, crash issue still exists. Crashed 6 times after 20 runs.

After disable xla, tensorflow model can run successful with crash, but the latency is bad.

AastaLLL · March 31, 2021, 7:56am

Hi,

We want to reproduce this in our environment to have a look.
Could you share the detailed steps and resource to reproduce this?

Thanks.

chuanxiu.liu · March 31, 2021, 8:33am

On Jetpack 4.4 (cuda 10.2, cudnn 8.0, tensorrt 7.1.3.0, Tensorflow 2.4.1)
1, The code can not share because of Trade secret, sorry for that!
2, The custom op is common pixel level computing method, some if/else logic in it so we didn’t code it by CUDA kernel.
3, After close xla compilation for all the custom ops, crash issue still exists, crashed 6 times after 20 runs.
4, After enable xla compilation for the whole network, crashed 20 times after 20 runs.
5, After disable xla for the whole network, prue tensorflow model can run successful without crash

have also tested Jetpack 4.3 on Jetson AGX Xavier, find there is no crash issue on Jetpack4.3 (Tensorflow 2.2)(https://developer.nvidia.com/jetpack-43-archive),
On Jetpack 4.3 (cuda 10.0, cudnn 7.6.3, Tensorrt 6.0.1.10, Tensorflow 2.2)
1, After enable xla for the whole network, tensorflow xla model can run successful without crash, the latency is good.

After compare the Jetpack4.4 with Tensorflow 2.4 and Jetpack4.3 with Tensorflow2.2, found new latency slow issue for op of tf.math.unsorted_segment_max:
Jetpack4.3 with tensorflow 2.2: latency of “tf.math.unsorted_segment_max” op on Jetson AGX Xavier: 2~3ms
Jetpack4.3 with tensorflow 2.4: latency of “tf.math.unsorted_segment_max” op on Jetson AGX Xavier: 120ms

AastaLLL · April 13, 2021, 12:34pm

Hi,

You don’t need to share the whole source.
Just extract a tiny reproducible source can help a lot.

Not sure if we understand your question correctly.
It seems that TensorFlow XLA compiles source for CPU.
But the main difference between JetPack4.3 and 4.4 is the GPU related library, not related to CPU.

Thanks.

Topic		Replies	Views
Tensorflow 1.15.5 can't sense GPU Jetson AGX Xavier tensorflow	6	1404	October 18, 2021
Tensorflow doesn't use GPU on Jetson Xavier AGX Jetson AGX Xavier tensorflow , python , gpu , jetson	8	1915	December 28, 2022
Tensorflow model doesn't produce results on Jetson Xavier but does on local computer Jetson AGX Xavier jetpack , cuda , tensorflow , docker	4	727	October 18, 2021
RAM Memory issue while using Tensorflow 1.13.0 Jetson AGX Xavier	6	1713	May 20, 2019
Nvidia Jetson Xavier Tensorflow Error Jetson Xavier NX cuda , tensorflow	6	681	August 17, 2023
TensorFlow wheel for JetPack 4.0 !! Jetson AGX Xavier	16	3682	October 15, 2018
Install Tensorflow on JetPack 4.6.1 on Xavier nx Jetson Xavier NX tensorflow	12	1246	January 16, 2024
TensorFlow getting different results on Jetson Jetson AGX Xavier tensorflow , nvbugs	9	1081	November 27, 2020
Something (NUMP, PTX) Error running Tensorflow on JETSON XAVIER NX Jetson Xavier NX cuda , tensorflow	7	57	November 28, 2024
Problem running tensorflow Jetson Xavier NX tensorflow	4	2607	May 3, 2023

Tensorflow_xla model inference crash on Jetson AGX xavier

Related topics