Mask-RCNN on Jetson Xavier NX issue 'std::logic_error'

Hi,

I am trying to run inferencing of Mask RCNN neural network (https://github.com/matterport/Mask_RCNN) on my Jetson Xavier NX and I get the following error:

terminate called after throwing an instance of ‘std::logic_error’
what(): basic_string::_M_construct null not valid

Looks to me like something went wrong in CUDA/CUDNN. I run the same inference script and model on Jetson TX2 with no issues. Can someone help.

Tensorflow version: 1.15.2
Keras version: 2.3.1
Jetpack: 4.4 DP

The full log of the error is as per below:

2020-06-09 01:10:46.938496: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
Using TensorFlow backend.

Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.9
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 800
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {‘rpn_class_loss’: 1.0, ‘rpn_bbox_loss’: 1.0, ‘mrcnn_class_loss’: 1.0, ‘mrcnn_bbox_loss’: 1.0, ‘mrcnn_mask_loss’: 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME balloon
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 100
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 50
WEIGHT_DECAY 0.0001

WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/py3venv/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/py3venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4070: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/Mask_RCNN/mrcnn/model.py:341: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/Mask_RCNN/mrcnn/model.py:399: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/Mask_RCNN/mrcnn/model.py:423: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version.
Instructions for updating:
box_ind is deprecated, use box_indices instead
WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/Mask_RCNN/mrcnn/model.py:720: The name tf.sets.set_intersection is deprecated. Please use tf.sets.intersection instead.

WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/Mask_RCNN/mrcnn/model.py:722: The name tf.sparse_tensor_to_dense is deprecated. Please use tf.sparse.to_dense instead.

WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/Mask_RCNN/mrcnn/model.py:772: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2020-06-09 01:11:27.179239: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-06-09 01:11:27.238339: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.238625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.109
pciBusID: 0000:00:00.0
2020-06-09 01:11:27.238726: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-09 01:11:27.289147: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-09 01:11:27.348348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-06-09 01:11:27.410311: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-06-09 01:11:27.516189: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-06-09 01:11:27.567170: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-06-09 01:11:27.568693: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-09 01:11:27.569129: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.569561: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.569739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-06-09 01:11:27.599165: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-06-09 01:11:27.600509: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3e516100 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-09 01:11:27.600766: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-09 01:11:27.779628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.780197: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3e513bc0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-09 01:11:27.780333: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Xavier, Compute Capability 7.2
2020-06-09 01:11:27.781062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.781270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.109
pciBusID: 0000:00:00.0
2020-06-09 01:11:27.781453: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-09 01:11:27.781577: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-09 01:11:27.781664: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-06-09 01:11:27.781745: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-06-09 01:11:27.781934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-06-09 01:11:27.782058: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-06-09 01:11:27.782148: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-09 01:11:27.782451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.782900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:27.783133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-06-09 01:11:27.783392: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-09 01:11:32.460881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-09 01:11:32.460999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2020-06-09 01:11:32.461047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2020-06-09 01:11:32.461752: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:32.462181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero
2020-06-09 01:11:32.462439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2427 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Re-starting from epoch 30
Start detecting…
Processing 1 images
image shape: (2288, 2719, 3) min: 0.00000 max: 255.00000 uint8
molded_images shape: (1, 1024, 1024, 3) min: -123.70000 max: 151.10000 float64
image_metas shape: (1, 14) min: 0.00000 max: 2719.00000 float64
anchors shape: (1, 261888, 4) min: -0.35390 max: 1.29134 float32
WARNING:tensorflow:From /home/braintree-jetsonnx-1/python3_ws/py3venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2020-06-09 01:11:49.695996: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-09 01:11:50.607091: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-09 01:12:00.741027: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-06-09 01:12:00.749922: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-06-09 01:12:04.149524: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-06-09 01:12:04.149805: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
terminate called after throwing an instance of ‘std::logic_error’
what(): basic_string::_M_construct null not valid
Aborted (core dumped)

1 Like

Hi,

This is a known issue in the cuDNN v8.
Our internal team is still working on it.

There are some workaround discussed in this topic:

Thanks.

1 Like

Hello, is the problem solved?

1 Like

I have same error on my Xavier NX and waiting for fixing it.
Mask_RCNN works correctly on my Xavier AGX + Jetpack4.3.
The workaround is not easy because we need to re-build tensorflow with older cuDNN libraries…

I confirmed Mask_RCNN works with newest Jetpack 4.4 (not DP). Thanks!