How to debug "Invalid memory reference" while generating linker

I have the code mixing c++ and cuda, with thousands of cuda functions in thousands of cuda source files. Therefore, linker is needed. Each separate compilation is okay, but error reported while generating linker

/usr/local/cuda-5.5/bin/nvcc -arch=sm_35 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda-5.5/include -I/usr/local/cuda-5.5/samples/common/inc  -dlink *.o -o ../linker.o
nvcc error   : 'nvlink' died due to signal 11 (Invalid memory reference)
nvcc error   : 'nvlink' core dumped

I was wondering what can cause this error. Is it due to bug of nvcc or my code? How can I fix it? Even if it is nvcc bug, what can I do to avoid triggering this bug?

BTW, I have tried nvcc versions 5.0, 5.5 and 6.0 on both Linux and MacOSX. None of them works.

it may be a bug in nvcc

The best is if you can provide a sample test case that will reproduce the issue.

If not, I would try CUDA 6.5RC also.

The code has been uploaded to

[url]https://github.com/kevinphy/git_V1_Proc0[/url]

It is pretty large and compilation will take long time.

It is difficult to try CUDA6.5RC, since the server admin does not want RC version installed.

After compiling about 275 of the ~4000 modules in SubProcesses, I got the following compile error:

...
nvcc -arch=sm_30 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc  -dc V1Amp4Hel11Sec11FiniteIntegral.cu
nvcc -arch=sm_30 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc  -dc V1Amp7Hel10Sec5FiniteIntegral.cu
nvcc -arch=sm_30 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc  -dc V1Amp31Hel6epsM2Integral.cu
nvcc -arch=sm_30 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc  -dc V1Amp30Hel9epsM2Coeff.cu
V1Amp30Hel9epsM2Coeff.cu(29): error: identifier "vars15" is undefined

V1Amp30Hel9epsM2Coeff.cu(55): error: identifier "vars15" is undefined

V1Amp30Hel9epsM2Coeff.cu(74): error: identifier "vars15" is undefined

V1Amp30Hel9epsM2Coeff.cu(95): error: identifier "vars15" is undefined

V1Amp30Hel9epsM2Coeff.cu(116): error: identifier "vars15" is undefined

V1Amp30Hel9epsM2Coeff.cu(137): error: identifier "vars15" is undefined

6 errors detected in the compilation of "/tmp/tmpxft_00007508_00000000-6_V1Amp30Hel9epsM2Coeff.cpp1.ii".
make[1]: *** [V1Amp30Hel9epsM2Coeff.o] Error 2
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory `/home/bob/misc/j8/git_V1_Proc0-master/SubProcesses'
make: *** [linker.o] Error 2
$

This seems to be a legitimate problem with the code, not a tool failure. Many of the *Coeff.cu files use this identifier, but I can’t find it defined anywhere. It’s not referenced in any of the header files.

By the way, it’s not difficult to install a cuda toolkit in your local userspace, and compile using it. You don’t need any assistance from a sysadmin for that.

Sorry, it was a slightly old version. Now the git version has been updated. Please try again.

The version 6.5 also failed, but with different error message:

/home/hbsally/local/cuda-6.5/bin/nvcc -arch=sm_35 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/home/hbsally/local/cuda-6.5/include -I/home/hbsally/local/cuda-6.5/samples/common/inc  -dlink *.o -o ../linker.o
nvlink fatal   : Internal error: section not mapped

I have reproduced the (CUDA 6.5RC) linker error (after ~24 hours of compiling ~4000 modules) and have filed a bug with NVIDIA. If I learn anything relevant, I will update this thread.

This appears to be a limitation of the device linker in nvcc at this time. It is triggered basically by having a large number of functions being linked together. I don’t expect any changes when CUDA 6.5 production release ships, as that is imminent. However, there may be some improvements after that. I don’t have any great suggestions to offer for workarounds at this time, however these may be some possibilities:

  1. If you are linking in unnecessary functions/files, don’t
  2. If you can separate your function call graph into groups of functions that call each other, but don’t call outside of the group, you may be able to work around this by sub-linking the files that contain those groups together, ie. link each group into a separate combined object that only contains functions/files from that group. Once you have the sub-groups linked into separate objects, you should be able to link those objects together without running into this issue (assuming the groups are truly separate and don’t call into each other).

The NVIDIA compiler team is aware of the issue. Thanks for bringing it to their attention.

Thank you so much for your help.

So I can compile the functions in groups, and compile the groups together. But how can I do the compilation? How to do the sub-linking compilation specifically?

For example, is it correct to compile functions A1,…,A100 and generate linkerA, then functions B1,…,B100 and generate linkerB, and finally generate linker with linkerA and linkerB, and then the final collection compilation?

Yes, that is the idea. However it’s not as simple as just breaking it into groups. The groups must have call graphs that don’t intersect. This means that functions in group A cannot call any functions in group B, and vice versa.

Now I am trying to change the code structure for breaking into groups. But it seems the constant memory cannot be used for separate compilation with multiple linkers. If the non-“extern” declaration of constant memory is in one linker, the other linkers will complain “undefined”, and if the non-“extern” declaration of constant memory is in each linker, the compilation will complain “redefined”. Then I changed all constant memory into global memory using struct to transfer, which however results to about 20 times slower of code run. This is definitely not acceptable.

Do I have other choice? Or do I have other approach using constant memory for separate compilation and multiple linkers?

Thanks.

Try moving the constant data to global memory, but explicitly use the read-only data cache to access it:

This is only available on cc 3.5 devices, but the last makefile you sent in your project appears to be compiling for cc 3.5

Found a bug during changing code, now the global memory version is running at almost the same speed as constant memory. However the problem arises that sm_35 compilation makes kernel failed to launch “too many resources requested for launch”, but the sm_30 compilation works. Therefore, unfortunately I cannot try __ldg function. Did I touch some cuda internal bug again?

Yes, the 20x slowdown for constant vs. non- constant sounded a little extreme to me. “too many resources requested for launch” is likely just a registers per thread issue, not a internal bug. Use -Xptxas=-v during compile to print out registers used per thread on each kernel, then multiply by the number of threads being launched, and compare with the maximum for cc3.5 device. Then use launch_bounds on the offending kernels to reduce their usage of registers to fit below the limit. If you google “too many resources requested for launch” you’ll find plenty of guidance.

The weird thing actually is the same GPU, cc3.0 works but cc3.5 complains “too many resources requested for launch”.
Then I tried to check out register number that is used. However, the nvcc compiler does not provide the register information but other confusing info. For example, for cc3.0 compilation

/usr/local/cuda-5.5/bin/nvcc -gencode arch=compute_30,code=sm_30 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda-5.5/include -I/usr/local/cuda-5.5/samples/common/inc  -Xptxas=-v -dc V1Amp4Hel4Sec11FiniteIntegral.cu

ptxas info    : Function properties for _Z30V1Amp4Hel4Sec11FiniteIntegral4PdS_P8PSGLOBALP11MODELGLOBAL
    440 bytes stack frame, 752 bytes spill stores, 696 bytes spill loads

and for cc3.5 compilation

/usr/local/cuda-5.5/bin/nvcc -gencode arch=compute_35,code=sm_35 -m64 -O3 -Xcudafe "--diag_suppress=declared_but_not_referenced" -lstdc++  -I/usr/local/cuda-5.5/include -I/usr/local/cuda-5.5/samples/common/inc  -Xptxas=-v -dc V1Amp4Hel4Sec11FiniteIntegral.cu

ptxas info    : Function properties for _Z30V1Amp4Hel4Sec11FiniteIntegral4PdS_P8PSGLOBALP11MODELGLOBAL
    344 bytes stack frame, 332 bytes spill stores, 332 bytes spill loads

Anyway, later I checked using launch bounds for block size 512 and 116 blocks

__launch_bounds__(512,116)

which does not work and still complains about “too many resources requested for launch”.
Then when I used

__launch_bounds__(512)

I got the compilation error

nvlink error   : entry function '_Z22call_complex_integrandILm6822779959680144713EEviP7ComplexPdS2_S2_P8PSGLOBALP11MODELGLOBAL' with max regcount of 128 calls function '_Z30V1Amp4Hel4Sec8FiniteIntegral18PdS_P8PSGLOBALP11MODELGLOBAL' with regcount of 162

It seems the number of registers indeed exceeds, but launch_bounds does not help.

BTW, I have uploaded a sample code in
[url]https://github.com/kevinphy/git2_V1_Proc0[/url]

Compiling for cc3.0 vs. cc3.5 can certainly use a differing number of registers per thread.

Your file V1Amp4Hel4Sec11FiniteIntegra.cu contains no global functions (no kernel definitions). The compiler does not emit registers per thread usage for a device function, because this will depend on the thread from which that device function is called, and may also be impacted by inlining, for example.

Furthermore, reading the documentation:

[url]Programming Guide :: CUDA Toolkit Documentation

we see that launch_bounds is used to decorate global functions (only), not device functions.

So instead of decorating device functions in this file, you’ll need to identify the file that actually contains the kernel that is failing due to “too many resources requested for launch” and decorate that kernel appropriately.

As an aside, (again, read the documentation) your choice of 116 for the minBlocksPerMultiprocessor parameter probably doesn’t make sense. These devices can support a maximum of 16 blocks per multiprocessor, so specifying 116 is not sensible. This parameter is not used to specify the total number of blocks launched, but instead is a hint to the compiler about desired occupancy per SM.

Thank you so much for all your help.
We have figured out all the problems I think. The launch_bounds was indeed used for global function, but the key is that the device function used by global function has to be forceinline. Otherwise, the device function will be compiled without restriction from launch_bounds, and when the compiler combines global function and device function, problem arises.
Thanks to your kind help, we have much better understanding on the coding of CUDA. We really appreciate it.