Segmentation fault at training network

I am getting segmentation fault while training my neural network.

$ python tools/train_lanenet.py 

The output is as following:

2021-06-02 13:47:51.200238: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
W0602 13:47:51.293999 15212 deprecation.py:40] Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
W0602 13:47:56.888735 15212 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

W0602 13:47:58.538597 15212 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.read_file is deprecated. Please use tf.io.read_file instead.

W0602 13:47:58.540126 15212 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-06-02 13:48:25.111512: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-06-02 13:48:25.111947: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x13a23150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-02 13:48:25.111996: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-02 13:48:25.119393: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-02 13:48:25.217014: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.217320: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15064ac0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-06-02 13:48:25.217370: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X2, Compute Capability 6.2
2021-06-02 13:48:25.217657: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.217760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3
pciBusID: 0000:00:00.0
2021-06-02 13:48:25.217827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-06-02 13:48:25.221976: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-06-02 13:48:25.224597: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-06-02 13:48:25.225339: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-06-02 13:48:25.229724: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-06-02 13:48:25.232906: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-06-02 13:48:25.233694: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-06-02 13:48:25.233889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.234066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:25.234137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1794] Adding visible gpu devices: 0
2021-06-02 13:48:25.234214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-06-02 13:48:26.602513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 13:48:26.602598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0
2021-06-02 13:48:26.602625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
2021-06-02 13:48:26.602919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:26.603123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-06-02 13:48:26.603250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6672 MB memory) → physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
I0602 13:48:29.795616 15212 train_lanenet.py:232] Training from scratch
2021-06-02 13:49:00.076055: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Segmentation fault (core dumped)

With cuda-memcheck

$ cuda-memcheck python tools/train_lanenet.py

And it leaves the following information:

========= CUDA-MEMCHECK
2021-06-02 13:49:22.762973: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
W0602 13:49:22.858016 15263 deprecation.py:40] Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
W0602 13:49:28.682472 15263 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

W0602 13:49:30.281486 15263 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.read_file is deprecated. Please use tf.io.read_file instead.

W0602 13:49:30.282983 15263 module_wrapper.py:139] From /home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-06-02 13:49:56.950472: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-06-02 13:49:56.951233: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x26a59c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-02 13:49:56.951288: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-02 13:49:56.958765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
========= Program hit CUDA_ERROR_UNKNOWN (error 999) due to “unknown error” on CUDA API call to cuDevicePrimaryCtxRetain.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuDevicePrimaryCtxRetain + 0x114) [0x1d235c]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so [0x896dce4]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor3gpu9GpuDriver13CreateContextEiiRKNS_13DeviceOptionsEPPNS0_10GpuContextE + 0x160) [0x88667b8]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor3gpu11GpuExecutor4InitEiNS_13DeviceOptionsE + 0x14c) [0x6fac5c4]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor14StreamExecutor4InitENS_13DeviceOptionsE + 0x78) [0x8941a30]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 (_ZN15stream_executor3gpu12CudaPlatform19GetUncachedExecutorERKNS_20StreamExecutorConfigE + 0x1a8) [0x103f6b0]
2021-06-02 13:49:57.063774: W tensorflow/compiler/xla/service/platform_util.cc:210] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 [0x103e704]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so (_ZN15stream_executor13ExecutorCache11GetOrCreateERKNS_20StreamExecutorConfigERKSt8functionIFNS_4port8StatusOrISt10unique_ptrINS_14StreamExecutorESt14default_deleteIS8_EEEEvEE + 0x268) [0x8957820]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 (_ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE + 0x50) [0x103e7d8]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so [0x6e8f9c4]
========= Host Frame:/home/nvidia/lane-det/lane-det-venv/lib/python3.6/site-packages/tensorflow_core/python/…/libtensorflow_framework.so.1 (_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi + 0x3ec) [0xbf5b6c]
========= 2021-06-02 13:49:57.064913: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
========= Error: process didn’t terminate successfully
========= No CUDA-MEMCHECK results found

System Information:

Python: 3.6
JetPack: 4.5.1
Tensorflow : 1.15.5+nv21.5

Hi,

First, could you share the dmesg log with us?
(Reboot → reproduce the error → run $dmesg and share)

We also want to reproduce this issue in our environment.
Could you share the source and corresponding commands to reproduce this with us?

Thanks.

loginfo.txt (64.0 KB)

thanks, the log info of dmesg was uploaded.

For the source code you can refer to this link: Codes-for-Lane-Detection/SCNN-Tensorflow at master · cardwing/Codes-for-Lane-Detection · GitHub
I made some modification to fit the new tensorflow version and reduced the size of the networks, and the code can run correctly on my own computer. But after I transfer it to the Jetson TX2 platform, the error occurs.

Hi,

Based on the log, the issue occurs when trying to initialize the CUDA library.
However, v1.15.5+nv21.5 should work for JetPack4.5.1.

Could you test other script to see if it works or not?
Thanks.

Thank you, AastaLLL.

I can not make sure my other scripts are perfectly correct, so could you provide some sample examples (tensorflow) for Jetson TX2 to test it? By this way, we can clearly know what the problem is.

Thanks again.

Hi,

Thanks for your patience.

For example, could you test the following?

$ python3
>>> import numpy as np
>>> import tensorflow as tf
>>> D = tf.convert_to_tensor(np.array([[1., 2., 3.], [-3., -7., -1.], [0., 5., -2.]]))
>>> print(tf.linalg.det(D))

Thanks.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.