Strange CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND error from cudaMallocManaged

Hi Folks,

I’m a CUDA newbie porting some existing math-heavy code to CUDA, and I’ve hit an intractable problem - I’ve ported a fair bit of code just fine, but have now hit a problem where cudaMallocManaged is failing with CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND. The strange thing is that whether this error appears or not, depends entirely on what the device procedure contains.

If the device code contains a call to either f1() or f2() then everything is fine, but as soon as it contains a call to both I hit the CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND error.

I’m assuming that cudaMallocManaged is a bystander here - it’s the first CUDA runtime called so I assume this is the point at which the device meta-code gets translated for the actual hardware? Could I be hitting some kind of size limit here?

Many thanks!

I think you’re on the right track with your thought process.

A symbol not found error can occur when you have device code that incorporates a failure detectable at load time. Such failures might be a device binary that does not match the GPU you are trying to run on, for example.

A way that these errors can creep in is if you are specifying a compilation that involves compile to PTX only (or compile to PTX+SASS, but not specifying the correct SASS architecture for your GPU). Either of these approaches can involve a JIT-compile at runtime/load-time. This JIT compile can fail (e.g. hitting a machine limit) and as a result you have no binary for your GPU, so things won’t work, and one of the side-effects is that device symbols are not loaded/visible. If your first evidence of this is a “bystander” operation that touches a device symbol, you’ll get a wierd device-symbol-not-found error. A full walk-through of such a case is here:

http://stackoverflow.com/questions/22364926/cuda-invalid-device-symbol-error/22366668#22366668

So I guess the first question I would have is, what is your exact compile command line, and what GPU are you actually running the code on when you witness the CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND error ?

If your case involves a JIT-compile, and if the JIT-compile is actually failing in your case, the solution to make the problem more “visible” is to force the PTX-to-SASS compile step to occur at compile time. You can do this by accurately specifying a device architecture to compile for which matches your GPU.

For example if you were compiling for a cc3.5 architecture, but running on a cc5.0 device, you might be specifying -arch=sm_35. The “fix” would be to specify -arch=sm_50 when compiling. If you do that with the code configuration that calls both f1() and f2(), you may witness a compile-time error which will be instructive.

Specifying -arch made no difference: I see no new compiler messages but the same runtime error. However, simply defining NDEBUG=1 so that asserts became no-ops fixed the issue for now.

Are you using assert() calls in device code, by any chance? Best I know, the standard function assert() is not supported in device code, so it makes sense that its use would trigger a JIT compilation error.

If the assert() instance in question are in host code, on the other hand, use of NDEBUG=1 will cause the assertions to be inactivated, which means the program is now potentially ignoring real errors.

are you on MacOS?

assert is supported in device code but not on MacOS:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assertion

“It is not supported on MacOS, regardless of the device, and loading a module that references the assert function on Mac OS will fail.”

So if you are on MacOS and you are using assert in device code, that could be an explanation for the module load failure leading to the symbol reference failure. (However, I can’t connect that with the f1()/f2() data point.)

@txbob: Thanks for the correction, I apparently missed the fact that support for device-side assert() had been added to CUDA with exception of the OS X platform.

Not on the Mac no, on Windows with vc12 host compiler.

The code in question is existing generic code (Boost.Math library) which I’m investigating porting so it can be used on the device as well as the host. The asserts are macro-ised, so longer term I can search and replace them with something that always evaluates to a no-op on the device only. Strangely it’s only when the assert is mixed with “something else” (and I’m not sure what that is yet) that the problem manifests.