Slowly inference on Xavier NX and OOM fault with TensorFlow 2

cluri · August 30, 2021, 10:36am

Hi,

I’m in the process of running some of our company’s nets on the Jetson. For the inference I used this Github project. Tensorflow-Object-Detection-with-Tensorflow-2.0/detect_from_image.py at master · TannerGilbert/Tensorflow-Object-Detection-with-Tensorflow-2.0 · GitHub
I only made a few small changes to save the results as an image, nothing special. I have an SSD Mobilenet V2 network and an RCNN Resnet101 v1. I think I’m doing something wrong because running the Python script takes way too long in my opinion. The RCNN Resnet101 v1 network cannot be executed correctly either, because it breaks off in the middle with an OOM error. I also tracked it in tegrastats, how the RAM memory went to the maximum. I then read that the Jetson platform had problems allocating memory with TensorFlow. However, this post was for the Jetson Nano and this is also from 2019. https://forums.developer.nvidia.com/t/slow-to-run-tensorflow-resnet-how-do-i-increase-ram-available -to-gpu / 76419/5
Instead, you should use TensorRT. Is that still up to date and does it also apply to the Xavier NX Developer Kit? I’ll add the console output here, maybe can detect an error?

The first lines in the output for ssd-mobileet for loading libraries take several tens of seconds. If I run the inference twice in Python script, the second time is of course faster. But it all seems so slow to me.

I still have commands for the TensorProfiler in the code, so some issues will come from it.

Thanks for your help.

software versions:
TensorFlow 2.5.0
CUDA 10.2.89
JetPack 4.5.1-b17

I had to put the output into a file because the output exceeded the character limit.

rcnn-resnet.txt (137.4 KB)

cluri · August 30, 2021, 10:38am

this is for ssd-mobilenet:

2021-08-30 11:58:57.116817: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-30 11:59:06.516826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-30 11:59:06.557076: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:06.557372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7,58GiB deviceMemoryBandwidth: 66,10GiB/s
2021-08-30 11:59:06.557472: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-30 11:59:06.606590: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-30 11:59:06.607687: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2021-08-30 11:59:06.625367: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-30 11:59:06.667181: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-30 11:59:06.685707: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-08-30 11:59:06.703972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2021-08-30 11:59:06.705111: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-30 11:59:06.705643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:06.706125: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:06.706575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-30 11:59:06.727893: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-08-30 11:59:06.727987: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-08-30 11:59:06.728135: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-08-30 11:59:06.744209: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcupti.so.10.2
2021-08-30 11:59:36.632407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:36.633732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7,58GiB deviceMemoryBandwidth: 66,10GiB/s
2021-08-30 11:59:36.634002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:36.634331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:36.634423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-30 11:59:36.634580: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-30 11:59:40.627209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-30 11:59:40.627349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-08-30 11:59:40.627407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-08-30 11:59:40.627819: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:40.628155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:40.628443: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:40.628625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1628 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2021-08-30 12:01:57.497325: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-30 12:01:58.396764: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 31000000 Hz
2021-08-30 12:02:05.768983: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-30 12:02:07.882579: I tensorflow/stream_executor/cuda/cuda_dnn.cc:380] Loaded cuDNN version 8000
2021-08-30 12:02:14.281739: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-30 12:02:25.864267: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:25.897688: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:26.069769: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,79GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:26.069978: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,79GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.920629: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,76GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.920851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,76GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.967980: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,77GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.968215: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,77GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:28.434495: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2,51GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:28.434783: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2,51GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:28.565390: W tensorflow/core/common_runtime/bfc_allocator.cc:337] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
{'detection_anchor_indices': <tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 9419., 10193., 14057., 10079.,  9359.,  9275., 10139., 15377.,
        14663., 10799., 13913., 10853.,  9389., 10307., 13307., 12299.,
        26728., 26410., 15377., 13451., 25834., 14771., 10109., 14171.,
         9329.,  9305., 25948., 26093.,  9335.,  9335., 10025., 10253.,
        26069., 14509., 13973.,  8699., 15461.,  9305., 10025.,  9305.,
        11723.,  9995., 27946., 11405., 26452., 11555., 25648., 25648.,
        13997., 10169., 26422., 26710., 30466., 26194.,  9305., 11555.,
        14747., 26296., 15151., 10079., 26626., 26266., 30520.,  9995.,
        26482., 25864.,  4738., 13391., 12527., 25678., 24019.,  5968.,
        14573., 26656., 11723., 11429., 12557., 10979., 26338., 10025.,
        26296.,  9935., 14915., 25966., 30570., 30498., 26062.,  8729.,
        15569., 10193.,  9305., 11495., 26998., 30520., 15289., 13043.,
        29035., 11951.,  9329., 11981.]], dtype=float32)>, 'detection_boxes': <tf.Tensor: shape=(1, 100, 4), dtype=float32, numpy=
array([[[3.29538405e-01, 3.26002032e-01, 5.21920264e-01, 3.70194346e-01],
        [3.37561607e-01, 6.68904066e-01, 5.51527321e-01, 7.14050770e-01],
        [4.81567234e-01, 3.76965284e-01, 7.15000927e-01, 4.22366321e-01],
        [3.34119827e-01, 4.75237340e-01, 5.49582362e-01, 5.19531071e-01],

.... (and then are the results.)

AastaLLL · August 31, 2021, 3:40am

Hi,

Please noted that XavierNX only has 8GB of memory and needs to share for CPU and GPU.
RCNN Resnet101 seems to be too complicated for NX.

Have you tried to run the model on a desktop GPU?
If yes, would you mind measuring the required memory of the model first?

About SSD MobileNetV2, the slowness comes from TensorFlow initialization.
For edge devices, we recommend using a lightweight framework like TensorRT.

Last, in case you don’t know, you can boost device performance with the below command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Topic		Replies	Views
Mask-RCNN on Jetson Xavier NX issue 'std::logic_error' Jetson Xavier NX cuda , tensorflow	6	1361	October 18, 2021
Query Regarding Memory Expansion for Jetson Nano Jetson Nano cuda , cudnn , jetson-nano	4	240	April 24, 2024
Jetson Xavier NX - Tensorflow 2 container slower on GPU than on CPU Jetson Xavier NX tensorflow	5	2540	October 18, 2021
TensorFlow GPU device created with only 1591MB memory (or is it 3.87GiB?), despite there being over 20GB available Jetson Nano tensorflow , tf-trt	2	2701	June 25, 2021
Power error while using TensorFlow Jetson Nano tensorflow , power , jetson-inference	2	1463	August 29, 2021
Fail to initialize CUDNN when running tensorflow: CUDNN_STATUS_INTERNAL_ERROR Jetson AGX Xavier tensorflow , cudnn	7	2809	October 18, 2021
TensorRT optimization random outcome Jetson Nano	5	791	October 15, 2021
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5574	October 18, 2021
Jetson TX2 Tensorrt l4t-tensorflow NGC Segmentation fault at build trt graphconverterV2 Jetson TX2 tensorrt	4	474	May 17, 2023
Reduce TensorFlow GPU usage Jetson TX2	10	1274	October 18, 2021

Slowly inference on Xavier NX and OOM fault with TensorFlow 2

Related topics