BUG: libnvptxcompiler_static.a segfaults when linked with LLD on x86_64

libnvptxcompiler_static.a crashes with a segmentation fault in pthread_mutex_lock() when linked with LLVM’s LLD linker on x86_64. The crash occurs because a mutex pointer is NULL (address 0x10 indicates offset into a NULL struct). The same code works correctly when linked with GNU ld, and works on aarch64 with LLD.

All tested CUDA versions on x86_64 exhibit this issue:

- CUDA 12.6.85

- CUDA 12.8.93

- CUDA 13.0.88

- CUDA 13.1.115

All work correctly on aarch64:

- CUDA 13.1.115

We tested across 8 different environment configurations on x86_64 (all sm_86)

| OS | CUDA Source | CUDA Version | Clang | LLD | Result |

|----|-------------|--------------|-------|-----|--------|

| Fedora 42 | Fedora 42 repos | 13.1 | 20.1.8 | 20.1.8 | SEGFAULT |

| Fedora 42 | RHEL10 repos | 13.1 | 20.1.8 | 20.1.8 | SEGFAULT |

| Fedora 43 | Fedora 42 repos | 13.1 | 21.1.8 | 21.1.8 | SEGFAULT |

| Fedora 43 | RHEL10 repos | 13.1 | 21.1.8 | 21.1.8 | SEGFAULT |

| Fedora 43 | RHEL10 repos | 13.0 | 21.1.8 | 21.1.8 | SEGFAULT |

| Fedora 43 | RHEL9 repos | 12.8 | 21.1.8 | 21.1.8 | SEGFAULT |

| Fedora 43 | RHEL9 repos | 12.6 | 21.1.8 | 21.1.8 | SEGFAULT |

| Fedora Rawhide | RHEL10 repos | 13.1 | 21.1.8 | 21.1.8 | SEGFAULT |

When using GNU ld instead all configurations work

On aarch64 (Apple Silicon) things work as expected:

- Fedora 43 + CUDA 13.1 + Clang 21.1.8 + LLD 21.1.8

Environment

x86_64 (affected)

- CPU: Intel Tigerlake (tested in Docker on Linux 6.18.7)

- OS: Fedora 42/43/Rawhide (tested across multiple variants)

- Toolchain: Clang 20.1.8 - 21.1.8, LLD (same versions), GCC 15.2.1

- GNU ld: 2.45.1

aarch64 (not affected):

- CPU: Apple Silicon M-series

- OS: Fedora 43 (Linux 6.10.14-linuxkit)

- Same Clang/LLD versions

Reproduction:

```cpp

#include <cstring>

#include <cstdio>

#include <nvPTXCompiler.h>

int main() {

    const char *ptx_code = R"(

        .version 7.0

        .target sm_86

        .address_size 64

        .visible .entry dummy_kernel() { ret; }

    )";



    nvPTXCompilerHandle compiler;

    nvPTXCompilerCreate(&compiler, strlen(ptx_code), ptx_code);

    const char* options[] = {"--gpu-name=sm_86"};

    nvPTXCompilerCompile(compiler, 1, options);  // CRASH HERE

    nvPTXCompilerDestroy(&compiler);

    return 0;

}

Crashes (LLD on x86_64):

clang++ -fuse-ld=lld -o test main.cpp \
-I/usr/local/cuda/include \\
/usr/local/cuda/lib64/libnvptxcompiler_static.a -ldl -lpthread

./test # Segmentation fault

Works (GNU ld on x86_64):

clang++ -o test main.cpp \
-I/usr/local/cuda/include \\
/usr/local/cuda/lib64/libnvptxcompiler_static.a -ldl -lpthread

./test # Success

Works (LLD on aarch64):

clang++ -fuse-ld=lld -o test main.cpp \
-I/usr/local/cuda/include \\
/usr/local/cuda/lib64/libnvptxcompiler_static.a -ldl -lpthread

./test # Success

Stack Trace (UBSan)

==697==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000010
==697==The signal is caused by a READ memory access.
==697==Hint: address points to the zero page.
#0 pthread_mutex_lock@@GLIBC_2.2.5 (/lib64/libc.so.6)
#1 libnvptxcompiler_static_10d97869a92ae1171b368b8b13994d673e6bb182
#2 \__cuda_CallJitEntryPoint
#3 nvPTXCompilerCompile
#4 main
Register rdi = 0x0000000000000000 (NULL mutex pointer)

Summary is that it works on aarch64 but fails on x86_64 with identical toolchain, fails only with LLD, works with GNU ld, is version and toolchain-independent.

This prevents use of `libnvptxcompiler_static.a` in any project using the LLVM toolchain with LLD on x86_64 and rust projects (Rust uses LLD by default)

The workaround is to force GNU ld:

clang++ -fuse-ld=ld …  # Use GNU ld

# or simply omit -fuse-ld flag to use default linker

Looks like this is related to the old issue - some CUDA static libraries still use `.ctors` and `.dtors` instead of `.init_array` and `.fini_array`, respectively.

As a result, the resultant executable ends up having both types of sections, which makes `glibc` silently ignore `.ctors` and `.dtors`, so the underlying global mutexes are never initialized (it’s why we’re getting this weird error w/ an attempt to invoke `pthread_mutex_lock` on null pointer).

`ld` (`ld.bfd`) and `ld.gold` by default perform merging / renaming of `.ctors` and `.dtors` into these modern section names, but not the `ld.lld` as use of `.ctors` and `.dtors` was considered deprecated 25-35 years ago.

This seems to be fixed for the most binaries around CUDA 11 - e.g., `libcudart_static` isn’t affected by this anymore. aarch64 static libraries aren’t affected by this either.

Seems people first discovered and submitted this bug to LLVM in 2016 but because this issue is too specific to CUDA on x86_64, the associated patch was never merged.

As a workaround, one could simply patch associated sections in place: e.g.,

llvm-objcopy --rename-section .ctors=.init_array --rename-section .dtors=.fini_array /usr/local/cuda-13.1/targets/x86_64-linux/lib/libnvptxcompiler_static.a

References:

1. lld produces broken executable with CUDA · Issue #30572 · llvm/llvm-project · GitHub

2. .ctors sections in static libraries