Segmentation faults and illegal memory address accesses when running Tensorflow code

urkchar · February 9, 2021, 12:19pm

When testing gpu-burn on a Linux kernel, I keep getting segmentation faults.

XXX:~/gpu-burn$ ./gpu_burn 60
GPU 0: GeForce GTX 1070 (UUID: GPU-8c57e0f7-03ca-bd20-fe5e-b25482e4ed9b)
Segmentation fault
XXX:~/gpu-burn$ Initialized device 0 with 8192 MB of memory (7203 MB available, using 6482 MB of it), using FLOATS

When running Tensorflow code, I keep getting illegal memory address access errors.
See: https://github.com/tensorflow/tensorflow/issues/46247
People on the Tensorflow Discord team say that this not a Tensorflow issue since I get similar errors when running non-Tensorflow code such as gpu-burn. NVIDIA support said that my GPU was not faulty. This leaves CUDA as the cause of the issue. What can be done?

Robert_Crovella · February 9, 2021, 6:46pm

a segmentation fault and CUDA_ERROR_ILLEGAL_ADDRESS are not the same and the similarity is not really relevant here. One is an error in host code, the other is an error in device code.

Segmentation faults are errors in host code, which are generally not things that indicate CUDA as the “cause” of the issue.

urkchar · February 10, 2021, 9:47am

So the segmentation fault is believed to be an error in the code of gpu-burn? How can I fix the error?

What is an error in device code?

njuffa · February 11, 2021, 7:09pm

Device code is code running on the GPU. Host code is code running on the CPU.

You should be able to identify the root cause of the segmentation fault by using standard debugging techniques. If you are unfamiliar with standard debugging techniques, consider investing in a few books covering them. There probably are also tutorials on “how to debug software” online, but I am not familiar with those, as I learned debugging by doing it (no books or online tutorials back then).

urkchar · February 11, 2021, 7:14pm

Does an error in device code mean that my GPU is faulty? Or does it mean CUDA’s code is faulty?

njuffa · February 11, 2021, 7:20pm

An error in device code most likely means the person who wrote the code made a mistake, such as an out-of-bounds memory access leading to CUDA_ERROR_ILLEGAL_ADDRESS.

A segfault (segmentation fault; Windows: general protection fault) is something that occurs in host code running on the CPU and most likely means the person who wrote that code made a mistake, such as an out-of-bounds memory access.

It is rare, but possible, that out-of-bounds memory accesses are caused by bugs in the toolchain (compiler, linker, etc). I have never seen out-of-bounds accesses caused by faulty hardware, and while it is theoretically possible it seems highly improbable.

Topic		Replies	Views
Segmentation Fault in kernel.cu CUDA Programming and Performance	6	5161	July 13, 2011
Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on Ubuntu 18.04 with TensorFlow 2.2.0 Frameworks cuda , tensorflow	3	7920	January 5, 2023
Segmentation Fault, off and on CUDA Programming and Performance	1	710	September 18, 2011
Seg fault during second cuda kernel call CUDA Programming and Performance	4	3461	February 2, 2012
CUDA segmentation Fault Error in Cudastream CUDA Programming and Performance	3	2748	April 24, 2017
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	11078	September 26, 2017
Segmentation fault when running CUDA code CUDA on Windows Subsystem for Linux	2	1403	September 10, 2020
Segmentation fault at training network Jetson TX2 ai-training	6	2609	September 5, 2021
Segmentation Violation while using cuMemcpyHtoD CUDA Programming and Performance	3	1559	March 18, 2009
Segmentation fault in __pthread_getspecific called from libcuda.so.1 CUDA Programming and Performance	1	1122	April 26, 2013

Segmentation faults and illegal memory address accesses when running Tensorflow code

Related topics