Slowly inference on Xavier NX and OOM fault with TensorFlow 2

Hi,

I’m in the process of running some of our company’s nets on the Jetson. For the inference I used this Github project. Tensorflow-Object-Detection-with-Tensorflow-2.0/detect_from_image.py at master · TannerGilbert/Tensorflow-Object-Detection-with-Tensorflow-2.0 · GitHub
I only made a few small changes to save the results as an image, nothing special. I have an SSD Mobilenet V2 network and an RCNN Resnet101 v1. I think I’m doing something wrong because running the Python script takes way too long in my opinion. The RCNN Resnet101 v1 network cannot be executed correctly either, because it breaks off in the middle with an OOM error. I also tracked it in tegrastats, how the RAM memory went to the maximum. I then read that the Jetson platform had problems allocating memory with TensorFlow. However, this post was for the Jetson Nano and this is also from 2019. https://forums.developer.nvidia.com/t/slow-to-run-tensorflow-resnet-how-do-i-increase-ram-available -to-gpu / 76419/5
Instead, you should use TensorRT. Is that still up to date and does it also apply to the Xavier NX Developer Kit? I’ll add the console output here, maybe can detect an error?

The first lines in the output for ssd-mobileet for loading libraries take several tens of seconds. If I run the inference twice in Python script, the second time is of course faster. But it all seems so slow to me.

I still have commands for the TensorProfiler in the code, so some issues will come from it.

Thanks for your help.

software versions:
TensorFlow 2.5.0
CUDA 10.2.89
JetPack 4.5.1-b17

I had to put the output into a file because the output exceeded the character limit.

rcnn-resnet.txt (137.4 KB)

this is for ssd-mobilenet:

2021-08-30 11:58:57.116817: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-30 11:59:06.516826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-30 11:59:06.557076: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:06.557372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7,58GiB deviceMemoryBandwidth: 66,10GiB/s
2021-08-30 11:59:06.557472: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-30 11:59:06.606590: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-30 11:59:06.607687: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2021-08-30 11:59:06.625367: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-30 11:59:06.667181: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-30 11:59:06.685707: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-08-30 11:59:06.703972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2021-08-30 11:59:06.705111: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-30 11:59:06.705643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:06.706125: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:06.706575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-30 11:59:06.727893: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-08-30 11:59:06.727987: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-08-30 11:59:06.728135: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-08-30 11:59:06.744209: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcupti.so.10.2
2021-08-30 11:59:36.632407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:36.633732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7,58GiB deviceMemoryBandwidth: 66,10GiB/s
2021-08-30 11:59:36.634002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:36.634331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:36.634423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-30 11:59:36.634580: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-30 11:59:40.627209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-30 11:59:40.627349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-08-30 11:59:40.627407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-08-30 11:59:40.627819: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:40.628155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:40.628443: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-30 11:59:40.628625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1628 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2021-08-30 12:01:57.497325: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-30 12:01:58.396764: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 31000000 Hz
2021-08-30 12:02:05.768983: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-30 12:02:07.882579: I tensorflow/stream_executor/cuda/cuda_dnn.cc:380] Loaded cuDNN version 8000
2021-08-30 12:02:14.281739: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-30 12:02:25.864267: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:25.897688: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:26.069769: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,79GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:26.069978: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,79GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.920629: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,76GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.920851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,76GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.967980: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,77GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:27.968215: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1,77GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:28.434495: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2,51GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:28.434783: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2,51GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-08-30 12:02:28.565390: W tensorflow/core/common_runtime/bfc_allocator.cc:337] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
{'detection_anchor_indices': <tf.Tensor: shape=(1, 100), dtype=float32, numpy=
array([[ 9419., 10193., 14057., 10079.,  9359.,  9275., 10139., 15377.,
        14663., 10799., 13913., 10853.,  9389., 10307., 13307., 12299.,
        26728., 26410., 15377., 13451., 25834., 14771., 10109., 14171.,
         9329.,  9305., 25948., 26093.,  9335.,  9335., 10025., 10253.,
        26069., 14509., 13973.,  8699., 15461.,  9305., 10025.,  9305.,
        11723.,  9995., 27946., 11405., 26452., 11555., 25648., 25648.,
        13997., 10169., 26422., 26710., 30466., 26194.,  9305., 11555.,
        14747., 26296., 15151., 10079., 26626., 26266., 30520.,  9995.,
        26482., 25864.,  4738., 13391., 12527., 25678., 24019.,  5968.,
        14573., 26656., 11723., 11429., 12557., 10979., 26338., 10025.,
        26296.,  9935., 14915., 25966., 30570., 30498., 26062.,  8729.,
        15569., 10193.,  9305., 11495., 26998., 30520., 15289., 13043.,
        29035., 11951.,  9329., 11981.]], dtype=float32)>, 'detection_boxes': <tf.Tensor: shape=(1, 100, 4), dtype=float32, numpy=
array([[[3.29538405e-01, 3.26002032e-01, 5.21920264e-01, 3.70194346e-01],
        [3.37561607e-01, 6.68904066e-01, 5.51527321e-01, 7.14050770e-01],
        [4.81567234e-01, 3.76965284e-01, 7.15000927e-01, 4.22366321e-01],
        [3.34119827e-01, 4.75237340e-01, 5.49582362e-01, 5.19531071e-01],

.... (and then are the results.)

Hi,

Please noted that XavierNX only has 8GB of memory and needs to share for CPU and GPU.
RCNN Resnet101 seems to be too complicated for NX.

Have you tried to run the model on a desktop GPU?
If yes, would you mind measuring the required memory of the model first?

About SSD MobileNetV2, the slowness comes from TensorFlow initialization.
For edge devices, we recommend using a lightweight framework like TensorRT.

Last, in case you don’t know, you can boost device performance with the below command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

1 Like