Segmentation fault at training network

ancient-ghost · June 2, 2021, 12:12pm

I am getting segmentation fault while training my neural network.

$ python tools/train_lanenet.py

The output is as following:

2021-06-02 13:47:51.200238: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
W0602 13:47:51.293999 15212 deprecation.py:40] Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
W0602 13:47:56.888735 15212 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

community/20180907-contrib-sunset.md at master · tensorflow/community · GitHub
GitHub - tensorflow/addons: Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
GitHub - tensorflow/io: Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0602 13:47:58.538597 15212 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.read_file is deprecated. Please use tf.io.read_file instead.

W0602 13:47:58.540126 15212 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-06-02 13:48:25.111512: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-06-02 13:48:25.111947: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x13a23150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-02 13:48:25.111996: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-02 13:48:25.119393: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-02 13:48:25.217014: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.217320: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15064ac0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-06-02 13:48:25.217370: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X2, Compute Capability 6.2
2021-06-02 13:48:25.217657: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.217760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3
pciBusID: 0000:00:00.0
2021-06-02 13:48:25.217827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-06-02 13:48:25.221976: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-06-02 13:48:25.224597: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-06-02 13:48:25.225339: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-06-02 13:48:25.229724: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-06-02 13:48:25.232906: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-06-02 13:48:25.233694: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-06-02 13:48:25.233889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.234066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.234137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1794] Adding visible gpu devices: 0
2021-06-02 13:48:25.234214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-06-02 13:48:26.602513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 13:48:26.602598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0
2021-06-02 13:48:26.602625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
2021-06-02 13:48:26.602919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:26.603123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:26.603250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6672 MB memory) → physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
I0602 13:48:29.795616 15212 train_lanenet.py:232] Training from scratch
2021-06-02 13:49:00.076055: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Segmentation fault (core dumped)

With cuda-memcheck

$ cuda-memcheck python tools/train_lanenet.py

And it leaves the following information:

========= CUDA-MEMCHECK
2021-06-02 13:49:22.762973: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
W0602 13:49:22.858016 15263 deprecation.py:40] Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
W0602 13:49:28.682472 15263 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

community/20180907-contrib-sunset.md at master · tensorflow/community · GitHub
GitHub - tensorflow/addons: Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
GitHub - tensorflow/io: Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0602 13:49:30.281486 15263 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.read_file is deprecated. Please use tf.io.read_file instead.

W0602 13:49:30.282983 15263 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-06-02 13:49:56.950472: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-06-02 13:49:56.951233: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x26a59c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-02 13:49:56.951288: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-02 13:49:56.958765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
========= Program hit CUDA_ERROR_UNKNOWN (error 999) due to “unknown error” on CUDA API call to cuDevicePrimaryCtxRetain.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuDevicePrimaryCtxRetain + 0x114) [0x1d235c]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so [0x896dce4]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor3gpu9GpuDriver13CreateContextEiiRKNS_13DeviceOptionsEPPNS0_10GpuContextE + 0x160) [0x88667b8]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor3gpu11GpuExecutor4InitEiNS_13DeviceOptionsE + 0x14c) [0x6fac5c4]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor14StreamExecutor4InitENS_13DeviceOptionsE + 0x78) [0x8941a30]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 (_ZN15stream_executor3gpu12CudaPlatform19GetUncachedExecutorERKNS_20StreamExecutorConfigE + 0x1a8) [0x103f6b0]
2021-06-02 13:49:57.063774: W tensorflow/compiler/xla/service/platform_util.cc:210] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 [0x103e704]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor13ExecutorCache11GetOrCreateERKNS_20StreamExecutorConfigERKSt8functionIFNS_4port8StatusOrISt10unique_ptrINS_14StreamExecutorESt14default_deleteIS8_EEEEvEE + 0x268) [0x8957820]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 (_ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE + 0x50) [0x103e7d8]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so [0x6e8f9c4]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 (_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi + 0x3ec) [0xbf5b6c]
========= 2021-06-02 13:49:57.064913: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
========= Error: process didn’t terminate successfully
========= No CUDA-MEMCHECK results found

System Information:

Python: 3.6
JetPack: 4.5.1
Tensorflow : 1.15.5+nv21.5

AastaLLL · June 3, 2021, 2:43am

Hi,

First, could you share the dmesg log with us?
(Reboot → reproduce the error → run $dmesg and share)

We also want to reproduce this issue in our environment.
Could you share the source and corresponding commands to reproduce this with us?

Thanks.

ancient-ghost · June 3, 2021, 10:07am

loginfo.txt (64.0 KB)

thanks, the log info of dmesg was uploaded.

For the source code you can refer to this link: Codes-for-Lane-Detection/SCNN-Tensorflow at master · cardwing/Codes-for-Lane-Detection · GitHub
I made some modification to fit the new tensorflow version and reduced the size of the networks, and the code can run correctly on my own computer. But after I transfer it to the Jetson TX2 platform, the error occurs.

AastaLLL · June 15, 2021, 6:19am

Hi,

Based on the log, the issue occurs when trying to initialize the CUDA library.
However, v1.15.5+nv21.5 should work for JetPack4.5.1.

Could you test other script to see if it works or not?
Thanks.

ancient-ghost · June 15, 2021, 7:23am

Thank you, AastaLLL.

I can not make sure my other scripts are perfectly correct, so could you provide some sample examples (tensorflow) for Jetson TX2 to test it? By this way, we can clearly know what the problem is.

Thanks again.

AastaLLL · June 29, 2021, 8:46am

Hi,

Thanks for your patience.

For example, could you test the following?

$ python3
>>> import numpy as np
>>> import tensorflow as tf
>>> D = tf.convert_to_tensor(np.array([[1., 2., 3.], [-3., -7., -1.], [0., 5., -2.]]))
>>> print(tf.linalg.det(D))

Thanks.

system · September 5, 2021, 5:19am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson TX2 Tensorrt l4t-tensorflow NGC Segmentation fault at build trt graphconverterV2 Jetson TX2 tensorrt	4	474	May 17, 2023
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5573	October 18, 2021
Fail to initialize CUDNN when running tensorflow: CUDNN_STATUS_INTERNAL_ERROR Jetson AGX Xavier tensorflow , cudnn	7	2805	October 18, 2021
TensorFlow 2.0? Jetson Nano	22	6429	October 14, 2021
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed Jetson TX2	8	6278	October 18, 2021
CUDA_ERROR_LAUNCH_FAILED error when running TensorFlow mnist example Jetson TX2	4	2893	December 7, 2017
Tensorflow 2.1 with CUDA10.2 warnings .. Frameworks tensorflow	15	17746	July 3, 2020
Tensorflow Memory Error Jetson TX2	25	15290	October 18, 2021
kernel version 440.31.0 does not match DSO version 440.33.1 — cannot find working devices in this configuration Linux	4	20879	December 12, 2019
Editing detectnet.py from hello ai world to run a custom model Jetson Nano ai-training	7	3996	October 15, 2021

Segmentation fault at training network

The output is as following:

With cuda-memcheck

System Information:

Related topics