Strange link error seen by multiple people while building PyTorch cpp/cuda extensions

I get the same mysterious link error while building two different PyTorch cpp/cuda extensions on Windows. One is Facebook provided “SparseConvNet” and another named “Pointgroup_ops” by a research group somewhere else.
For the first library, the error is:

sparseconvnet_cuda.obj : error LNK2001: unresolved external symbol "public: long * __cdecl at::Tensor::data_ptr<long>(void)const " (??$data_ptr@J@Tensor@at@@QEBAPEAJXZ)
build\lib.win-amd64-3.7\sparseconvnet\SCN.cp37-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

For the second library, the error is:

pointgroup_ops.obj : error LNK2001: unresolved external symbol "public: long * __cdecl at::Tensor::data_ptr<long>(void)const " (??$data_ptr@J@Tensor@at@@QEBAPEAJXZ)
build\lib.win-amd64-3.7\PG_OP.cp37-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

They both look very similar. Any insights about the cause/fix? Of course, the code is same as the one that builds on Linux; so, it can’t be the case where something is genuinely not defined. ‘QEBAPEAJXZ’ is not a string that occurs in code.

I did an online search for the strange ‘QEBAPEAJXZ’ string and seems like people have encountered this in the context of building other libraries as well.
Here are two for example: How to execute build_and_install.sh on WIN10? · Issue #75 · sshaoshuai/PointRCNN · GitHub
抱歉,该内容已被作者删除 - 知乎

But, I did not find a solution that has worked. Any help in resolving this would be appreciated by me, and also probably by other people who have encountered this as well.
Thanks.

I smell big trouble. Programmers should not use long anywhere in code that is intended to be portable across platforms. On 64-bit Linux platforms, long resolves to long long (a 64-bit type), while on 64-bit Windows platforms, long resolves to int (a 32-bit type). CUDA maintains type size compatibility between host and device code, so this difference also applies to device code on those two platforms.

QEBAPEAJXZ is presumably just C++ name decoration resulting from data_ptr<long>(void)const. As C++ name mangling is toolchain specific, I don’t think you would want to chase that down.

I suspect the real reason this doesn’t resolve during linking is that the long here is matched up with either int or long long elsewhere, which makes the code link fine on one platform but not another. Can you make the code base “long clean”, i.e. replace all occurrences of long with more appropriate types? Guessing at the intentions, when the programmers used long they likely meant “signed integer type wider than int”, so as an initial attempt one could try replacing all instances of long with long long. You would still need to review whether those changes are in fact appropriate.

If the object being referred to is an actual pointer, a portable way of expressing that as an integer is to use uintptr_t, a type that is guaranteed by C and C++ language standards to be able to hold a pointer of any kind, e.g. for the purpose of performing bit manipulations on pointers.

2 Likes

Thank you! That was a very insightful diagnosis!
I replaced all occurrences of “long” in both the libraries by “int64_t” and the strange link error disappeared and both the libraries built fine. You made my day!

2 Likes