Hello,
I am packaging and deploying TensorRT/CUDA C++ applications in a custom way. I am bundling all of the required shared libraries together.
This is generally working and I can load binaries that depend on libcuda and libnvinfer8, etc.
However, I ran into an issue where in some cases my application was silently failing. By that I mean it was producing garbage output when running an inference, yet it was not throwing any errors anywhere.
This is using the libraries packaged with l4t 32_7.1.
I dug really deep using LD_DEBUG=libs and strace, and I have come to the following conclusion:
libnvinfer.so.8 tries to load libcuda.so at some point after libcuda.so.1 was already loaded. Looking at readelf -d libnvinfer.so.8, it does not list libcuda as NEEDED, so this leads me to believe it’s being loaded manually via dlopen or equivalent.
# readelf -d libnvinfer.so.8
Dynamic section at offset 0x9abb340 contains 39 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libnvdla_compiler.so]
0x0000000000000001 (NEEDED) Shared library: [libEGL.so.1]
0x0000000000000001 (NEEDED) Shared library: [libnvmedia.so]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-aarch64.so.1]
0x000000000000000e (SONAME) Library soname: [libnvinfer.so.8]
0x0000000000000010 (SYMBOLIC) 0x0
0x000000000000001d (RUNPATH) Library runpath: [$ORIGIN]
Things work when libnvinfer.so.8 loads a libcuda.so symlink that points to the same exact libcuda.so.1 that was loaded previously by ld, however based on my strace output, I observe that the search path for libcuda.so doesn’t match what I would expect from ld and dlopen.
If libnvinfer.so.8 finds a different libcuda.so (even if it is the exact same file, just a different instance), it wont’ have any errors, it will just produce garbage output. Moreover if it can’t even find a libcuda.so to open, it will also simply silently fail with garbage output.
Now that the problem is isolated, my question is: How can I influence the search paths that libnvinfer.so.8 is using when trying to find libcuda.so?
I observed the following openat calls after attempting to first open libcuda.so in the same directory as libnvinfer.so.8
openat(AT_FDCWD, "/lib/aarch64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/tls/aarch64/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/tls/aarch64/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/tls/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/tls/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/aarch64/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/aarch64/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/tls/aarch64/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/tls/aarch64/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/tls/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/tls/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/aarch64/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/aarch64/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/tls/aarch64/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/tls/aarch64/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/tls/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/tls/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/aarch64/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/atomics/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
This does not seem to be affected by ld.so.cache or LD_LIBRARY_PATH, which leads me to believe it’s not using standard mechanisms for searching.