nvlink error : Undefined reference

Recently, I intended to write a kernel function for each image pixel that involves multiple matrix operations (matrix-vector multiplication, matrix-matrix multiplication, etc).

Instead of writing some macro or inline functions for small matrices (like 3×3 matrix) by my own, I found that cuBLAS device api library can actually do the job for me, so I decided to give it a try.

But when I added the required cublas_v2.h and all the .lib files (cublas.lib, cublas_device.lib, cudadevrt.lib, cudart_static.lib) just like the in cuda sample simpleDevLibCUBLAS, I got the

ptxas fatal: Unresolved extern function ‘cublasCreate_v2’.

After a little while of googling, I found that I hadn’t set the Generate Relocatable Device Code option to Yes (-rdc=true) if I want to enable Dynamic Parallelism, I thought this would be it, but after I set -rdc=true, I got 940 errors instead, all like this:

CUDALINK : nvlink error : Undefined reference to ‘maxwell_hgemmBatched_256x128_raggedMn_nn’ in ‘C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0/lib/x64/cublas_device.lib:maxwell_sm50_hgemm_batched.obj’ (target: sm_61)

Even if I commented out my whole kernel function that call cuBLAS device api, the nvlink errors are still there.

I am quite confused since my gpu is gtx 1050ti which is pascal architecture (sm_61), what on earth does it have to do with maxwell sm_50?

Can somebody help me solving this problem? Thanks a lot.

Try replacing the code in the simpleDevLibCUBLAS project with yours and see if it compiles cleanly. If it does, then the problem is in your exact project setup.

txbob, Thank you very much for your advice, I think I found where the problem is, if I add ‘compute_61,sm_61’ to the Code Generation of the simpleDevLibCUBLAS project, it will also report tons of nvlink errors, and if I delete ‘compute_61,sm_61’ from my own project, my own project can build successfully, but WHY???

You should show us how you’re attempting to build your software here.

You definitely have issues with linking and we need to make sure that you’re referencing the proper files.

Btw, be careful. The cuBLAS device API is really just dynamic parallelism which may not actually give you the performance you’re likely seeking.

Writing matrix math for small matrices is actually relatively straight-forward so don’t be afraid to do it yourself should the need arise.

MutantJohn, thank you very much for your reply. The thing is the CUDA sample simpleDevLibCUBLAS project also report these nvlink errors with only ‘compute_61,sm_61’ added, nothing else has been changed to it, so I think maybe anybody with gtx1050ti and CUDA8.0 installed can replicate this error?

Keep in mind, without specifying the architecture, I think CUDA uses a JIT model. Once you compiled the code without specifying the architecture, did you actually try to run it? Runtime-compilation of the device code is likely to fail.

To parrot other users on this forum, it’d help us the most if you could provide a minimal example that exhibits the behavior as well as your compilation commands.

Just write a simple dummy.cu which attempts to call the desired functions and show us how you’re linking to the target libs.

I can run the original CUDA sample simpleDevLibCUBLAS project (provided with CUDA 8.0 toolkit) successfully on my gtx 1050ti, which specify the architechture to ‘compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60’, but once I add ‘compute_61,sm_61’ to the end the project reports nvlink errors.

I think this is expected behavior. I believe the cublas device library has a limited number of device linkable options. I believe these were limited to avoid code bloat in the library and because there were no useful differences for the architectures not listed in the sample project.

So if you want to use device cublas, limit yourself to the suggested architectures in the sample project. If you include sm_60 option, the code should run correctly on your cc6.1 device.

OK, got it, thank you txbob.

But, will the performance drop noticeably if sm_60 instead of sm_61 is specified to cc6.1 device?