Segmentation faults and illegal memory address accesses when running Tensorflow code

When testing gpu-burn on a Linux kernel, I keep getting segmentation faults.

XXX:~/gpu-burn$ ./gpu_burn 60
GPU 0: GeForce GTX 1070 (UUID: GPU-8c57e0f7-03ca-bd20-fe5e-b25482e4ed9b)
Segmentation fault
XXX:~/gpu-burn$ Initialized device 0 with 8192 MB of memory (7203 MB available, using 6482 MB of it), using FLOATS

When running Tensorflow code, I keep getting illegal memory address access errors.
See: https://github.com/tensorflow/tensorflow/issues/46247
People on the Tensorflow Discord team say that this not a Tensorflow issue since I get similar errors when running non-Tensorflow code such as gpu-burn. NVIDIA support said that my GPU was not faulty. This leaves CUDA as the cause of the issue. What can be done?

a segmentation fault and CUDA_ERROR_ILLEGAL_ADDRESS are not the same and the similarity is not really relevant here. One is an error in host code, the other is an error in device code.

Segmentation faults are errors in host code, which are generally not things that indicate CUDA as the “cause” of the issue.

So the segmentation fault is believed to be an error in the code of gpu-burn? How can I fix the error?

What is an error in device code?

Device code is code running on the GPU. Host code is code running on the CPU.

You should be able to identify the root cause of the segmentation fault by using standard debugging techniques. If you are unfamiliar with standard debugging techniques, consider investing in a few books covering them. There probably are also tutorials on “how to debug software” online, but I am not familiar with those, as I learned debugging by doing it (no books or online tutorials back then).

Does an error in device code mean that my GPU is faulty? Or does it mean CUDA’s code is faulty?

An error in device code most likely means the person who wrote the code made a mistake, such as an out-of-bounds memory access leading to CUDA_ERROR_ILLEGAL_ADDRESS.

A segfault (segmentation fault; Windows: general protection fault) is something that occurs in host code running on the CPU and most likely means the person who wrote that code made a mistake, such as an out-of-bounds memory access.

It is rare, but possible, that out-of-bounds memory accesses are caused by bugs in the toolchain (compiler, linker, etc). I have never seen out-of-bounds accesses caused by faulty hardware, and while it is theoretically possible it seems highly improbable.