Failed to Build Codebase in a Singularity Container

Ziqi · November 10, 2021, 3:39pm

I have a codebase successfully built in Linux SUSE with CUDA 11.4. The same codebase failed to build as it was bound to a container (image built with singularity) with a different Linux distribution (ubuntu) and CUDA (11.2). I got the following information from building the codebase in the container:

/usr/bin/g++ -lpthread -lrt -ldl -L/usr/local/cuda/lib64 -lcudart -lcuda /SMAQ/bin/release/smaq_client.cpp.o /SMAQ/bin/release/common.cpp.o /SMAQ/bin/release/algo_shared_mem.cpp.o -o /SMAQ/smaq_client
/usr/bin/ld: /SMAQ/bin/release/common.cpp.o: in function `makePinnedData(int, int, bool)':
common.cpp:(.text+0x173): undefined reference to `cudaMallocHost'
/usr/bin/ld: common.cpp:(.text+0x186): undefined reference to `cudaGetErrorString'
/usr/bin/ld: /SMAQ/bin/release/common.cpp.o: in function `makePinnedData_N(int, int, int, bool)':
common.cpp:(.text+0x39a): undefined reference to `cudaMallocHost'
/usr/bin/ld: common.cpp:(.text+0x3ad): undefined reference to `cudaGetErrorString'
/usr/bin/ld: /SMAQ/bin/release/common.cpp.o: in function `freePinnedData(short**)':
common.cpp:(.text+0x4a5): undefined reference to `cudaFreeHost'
/usr/bin/ld: common.cpp:(.text+0x4b8): undefined reference to `cudaGetErrorString'
/usr/bin/ld: /SMAQ/bin/release/algo_shared_mem.cpp.o: in function `openSharedFile(char const*, int)':
algo_shared_mem.cpp:(.text+0x147): undefined reference to `shm_open'
/usr/bin/ld: /SMAQ/bin/release/algo_shared_mem.cpp.o: in function `openExstSharedFile(char const*, int)':
algo_shared_mem.cpp:(.text+0x260): undefined reference to `shm_open'
collect2: error: ld returned 1 exit status
make: *** [makefile:129: /SMAQ/smaq_client] Error 1

In my understanding, the error occurred at linking stage, and the issue was that the linker could not find binary definition of functions such as “cudaMallocHost”, “cudaFreeHost”, “cudaGetErrorString”, and “shm_open”. I understand that a linker has to search binary dependency using LD_LIBRARY_PATH. In my understanding, the reported undefined reference are from CUDA runtime library (libcudart) and Linux GNU runtime library (librt). Indeed, when the container image was built, the paths of these libraries were not included in LD_LIBRARY_PATH. So within the container, I updated LD_LIBRARY_PATH using “export”, with the following verification from executing “echo $LD_LIBRARY_PATH”:

/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib/python3.8/dist-packages/tensorflow:/usr/lib64:/.singularity.d/libs:/usr/local/cuda-11.2/targets/x86_64-linux/lib:/usr/lib/x86_64-linux-gnu

However, the build error still existed. Can someone help me analyze the problem?

Robert_Crovella · November 10, 2021, 3:55pm

That’s not correct for the kind of linking you are referring to. LD_LIBRARY_PATH is an instruction to the runtime dynamic loader also called the runtime dynamic linker. Please google that if you want more instruction on what it does and what that env var is used for.

Therefore to fix your linking problem you may have to revisit the command itself that failed:

/usr/bin/g++ -lpthread -lrt -ldl -L/usr/local/cuda/lib64 -lcudart -lcuda /SMAQ/bin/release/smaq_client.cpp.o /SMAQ/bin/release/common.cpp.o /SMAQ/bin/release/algo_shared_mem.cpp.o -o /SMAQ/smaq_client

As an example, cudaMallocHost is part of the CUDA runtime library, and that library is specified with this switch:

-lcudart

The locations the linker will search for that library include:

-L/usr/local/cuda/lib64

Therefore I would normally surmise that in your container or container build process, that library (i.e. libcudart.so ) is not available at /usr/local/cuda/lib64.

However, in that case, ld would issue a message like unable to find -lcudart or similar. So something else is going on. It’s finding the library, but not resolving the symbols.

You might have a corrupted environment where, for example, you had previously built objects like common.cpp.o and are injecting those into your container, rather than re-compiling everything. Just a guess. Another possible case where I have seen this kind of strangeness is when mixing C and C++ style linking. Doesn’t seem likely here.

Ziqi · November 10, 2021, 4:43pm

Thanks for your remind of the true functionality of “LD_LIBRARY_PATH”. Yes, I made a mistake and the the variable is only looked up at runtime dynamic loading, not at linking stage.

I checked the directory “/usr/local/cuda/lib64” by executing “ls -la /usr/local/cuda/lib64/ | grep libcudart” WITHIN the container, which gives me the following information:

lrwxrwxrwx 1 root root        17 Feb  4  2021 libcudart.so -> libcudart.so.11.0
lrwxrwxrwx 1 root root        21 Feb  4  2021 libcudart.so.11.0 -> libcudart.so.11.2.146
-rw-r--r-- 1 root root    582008 Feb  4  2021 libcudart.so.11.2.146
-rw-r--r-- 1 root root    906670 Feb  4  2021 libcudart_static.a

I guess intuitively, there is no problem with the library from the above information. To rule out the possibility of corruption from existing object files, I removed all of them. Then by executing “make REL=1” in my codebase in the container, I got the following information (which is the same as before, only with extra compilation information I didn’t attach before, as I thought that was irrelevant):

/usr/bin/g++ -I/SMAQ/include -I/usr/local/cuda/include -MM -MF .dep/smaq_client.cpp.d -MT /SMAQ/bin/release/smaq_client.cpp.o /SMAQ/src/smaq_client.cpp
/usr/bin/g++  --std=c++11 -I/SMAQ/include -I/usr/local/cuda/include -c -fPIC /SMAQ/src/smaq_client.cpp -o /SMAQ/bin/release/smaq_client.cpp.o
/usr/bin/g++ -I/SMAQ/include -I/usr/local/cuda/include -MM -MF .dep/common.cpp.d -MT /SMAQ/bin/release/common.cpp.o /SMAQ/src/common.cpp
/usr/bin/g++  --std=c++11 -I/SMAQ/include -I/usr/local/cuda/include -c -fPIC /SMAQ/src/common.cpp -o /SMAQ/bin/release/common.cpp.o
/usr/bin/g++ -I/SMAQ/include -I/usr/local/cuda/include -MM -MF .dep/algo_shared_mem.cpp.d -MT /SMAQ/bin/release/algo_shared_mem.cpp.o /SMAQ/src/algo_shared_mem.cpp
/usr/bin/g++  --std=c++11 -I/SMAQ/include -I/usr/local/cuda/include -c -fPIC /SMAQ/src/algo_shared_mem.cpp -o /SMAQ/bin/release/algo_shared_mem.cpp.o
/usr/bin/g++ -lpthread -lrt -ldl -L/usr/local/cuda/lib64 -lcudart -lcuda /SMAQ/bin/release/smaq_client.cpp.o /SMAQ/bin/release/common.cpp.o /SMAQ/bin/release/algo_shared_mem.cpp.o -o /SMAQ/smaq_client
/usr/bin/ld: /SMAQ/bin/release/common.cpp.o: in function `makePinnedData(int, int, bool)':
common.cpp:(.text+0x173): undefined reference to `cudaMallocHost'
/usr/bin/ld: common.cpp:(.text+0x186): undefined reference to `cudaGetErrorString'
/usr/bin/ld: /SMAQ/bin/release/common.cpp.o: in function `makePinnedData_N(int, int, int, bool)':
common.cpp:(.text+0x39a): undefined reference to `cudaMallocHost'
/usr/bin/ld: common.cpp:(.text+0x3ad): undefined reference to `cudaGetErrorString'
/usr/bin/ld: /SMAQ/bin/release/common.cpp.o: in function `freePinnedData(short**)':
common.cpp:(.text+0x4a5): undefined reference to `cudaFreeHost'
/usr/bin/ld: common.cpp:(.text+0x4b8): undefined reference to `cudaGetErrorString'
/usr/bin/ld: /SMAQ/bin/release/algo_shared_mem.cpp.o: in function `openSharedFile(char const*, int)':
algo_shared_mem.cpp:(.text+0x147): undefined reference to `shm_open'
/usr/bin/ld: /SMAQ/bin/release/algo_shared_mem.cpp.o: in function `openExstSharedFile(char const*, int)':
algo_shared_mem.cpp:(.text+0x260): undefined reference to `shm_open'
collect2: error: ld returned 1 exit status
make: *** [makefile:129: /SMAQ/smaq_client] Error 1

It seems to me that the problem is still at linking stage, while the library is right there.

Robert_Crovella · November 10, 2021, 4:58pm

Another problem I have seen is that different versions of g++/ld (i.e. the GNU toolchain) have different sensitivities to expression of link order dependencies on the linking command line.

The general rule or best practice is that moving from left to right on the linking command line, dependencies on the left should be satisfied by providers on the right. You’ve broken that in your link command:

/usr/bin/g++ -lpthread -lrt -ldl -L/usr/local/cuda/lib64 -lcudart -lcuda /SMAQ/bin/release/smaq_client.cpp.o /SMAQ/bin/release/common.cpp.o /SMAQ/bin/release/algo_shared_mem.cpp.o -o /SMAQ/smaq_client

Notice that common.cpp.o which clearly has a dependency on libcudart, does not “see” libcudart to the right of it, on the link command line.

You could try fixing that by rearranging your Makefile to produce a command like this:

/usr/bin/g++ /SMAQ/bin/release/smaq_client.cpp.o /SMAQ/bin/release/common.cpp.o /SMAQ/bin/release/algo_shared_mem.cpp.o -o /SMAQ/smaq_client -L/usr/local/cuda/lib64 -lcudart -lcuda -lpthread -lrt -ldl

Note that to a first order approximation, this has nothing to do with CUDA, and is a function of the GNU toolchain(s) you are using/comparing. Not all versions of GNU behave the same with respect to “enforcement” of this idea, so I don’t know if it applies here, or not.

Ziqi · November 10, 2021, 6:00pm

Thank god! It is really the case! Now I understand why software engineering is a matter of experience. A good lesson for someone from academia… I am just wondering how to ever think from this direction, without much background, and in what situation we need to get the knowledge about linking precedence or so, and how to systematically build knowledge like this.