Linking device code

Hi,
I am calling from my kernels in file A.cu some device functions implemented in file mylib.cu, which is included in file A.cu. Here is my problem, I have to call the functions in file mylib.cu from some other kernels in file B.cu too. I tried to include “mylib.cu” in “B.cu” and obviously the linker complains about duplicate symbols.
Right now, I’ve solved this problem copying the file mylib.cu in mylib2.cu and including this in B.cu (I also added two different namespaces in mylib.cu and mylib2.cu) and everything works fine.
Is there a solution to this problem or a more elegant way to do this?
Thank you for your time.

PS. Merging file A.cu and file B.cu is not an option since they do totally different things.
PPS. I am using CUDA 6.5.

put the functions in c.cu, (forward) declare the functions in common_f.cuh, include common_f.cuh in both a.cu and b.cu…?

Yes. You can partition your project similarly to how you would partition an ordinary C or C++ project into separate modules or translation units. The wrinkle with CUDA is that when we are calling or referencing device code or symbols across separate translation units, we need to use so-called separate compilation and linking, also called device-compile and device-link, to link the device code pieces together.

Here is a fully worked example:

A.cu:

#include "mylib.h"
#include <stdio.h>

__global__ void kernel_B();

__global__ void kernel_A(){

  int my_val=mylib_func(4);
  printf("A: %d\n", my_val);
}

int main(){

  kernel_A<<<1,1>>>();
  kernel_B<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}

B.cu:

#include "mylib.h"
#include <stdio.h>
__global__ void kernel_B(){

  int data = mylib_func(6);
  printf("B: %d\n", data);
}

mylib.cu

#include "mylib.h"

__device__ int mylib_func(int data){
  return data/2;
}

mylib.h:

__device__ int mylib_func(int);

compile:

$ nvcc -dc A.cu
$ nvcc -dc B.cu
$ nvcc -dc mylib.cu

link:

$ nvcc -rdc=true A.o B.o mylib.o -o test

The -dc indicates “device compile”. -rdc indicates “create relocatable device code”

You can read more about separate compilation and linking in the nvcc manual:

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda

You may also be interested in the following thread on how to create libraries of device functions, where I gave a minimal but complete example:

https://devtalk.nvidia.com/default/topic/526645/how-to-create-a-static-lib-for-device-functions-using-cuda-5-0-/

Thank you so much guys!
I didn’t know the existence of those flags or the possibility to use ar with device code in object files.

HI,
I did as you said, but I am having some problems:

  1. I am compiling with --ptxas-options=-v, but if I compile with -dc I cannot understand how many registers is using my program, all I get is something like this:
    ptxas info : Function properties for _ZN14statistics_gpu17GetGamma_modifiedEP17curandStateXORWOWS1_dd
    72 bytes stack frame, 68 bytes spill stores, 68 bytes spill loads
    ptxas info : Function properties for _ZN14statistics_gpu10get_normalEP17curandStateXORWOW
    32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads
    ptxas info : Function properties for _ZN14statistics_gpu11get_uniformEP17curandStateXORWOW
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
    Is there a way to get the same output of --ptxas-options=-v compiled without -dc?

  2. A set of device functions compiled this way is not working (taken from http://people.sc.fsu.edu/~jburkardt/cpp_src/asa243/asa243.cpp , re-written in order to be device functions and work in CUDA). Let me explain: if I compile these functions in the same compilation unit (i.e. including “asa243.cu” in my program) everything works, but if I include “asa243.cuh” in the main file, compile “asa243.cu” with -dc and link everything with -rdc=true, all I get from these functions is nan.

Any help or even thoughts about this will be appreciated.

PS. In the file asa243.cpp I’m using alnorm, betain and tnc, not student_noncentral_cdf_values.

It seems to me that without information from whole-program compilation it is impossible to know the total register usage for a kernel until the code has been linked. txbob might be able to supply authoritative background information on this if this is not covered in the documentation.

Without a complete, compilable and runnable code sample that reproduces the problem and knowledge of the exact compiler invocations used, it seems impossible to diagnose (2). Make sure that a proper prototype is visible to all code that references a function from a different compilation unit. This is the same as in host compilation, where, for example, a missing #include <math.h> can cause bad results if math library functions are called.

Note that CUDA already offers the CDF of the standard normal distribution as the math library function normcdf{f}(), so there is no need to port third-party code for that.

Dear njuffa,

now I got the reason why the compiler cannot tell me how many registers each kernel is going to use: if a kernel calls some device functions implemented in another file, then the compiler will not know what is that kernel going to do nor how many registers it will use. But this leads me to another question: when is the number of registers settled? During the linking phase or just-in-time during the execution?

About the second point, I need that third-party code for the cdf of a t Student, because I did not find it in the CUDA SDK. I re-created the problem in this repo https://github.com/giogio12345/CUDA_Test: in the “working” directory I compile everything in the main file and I get 0.5, i.e. the t Student cdf of 0. In the “notworking” directory, where the program is compiled with -dc and -rdc=true, the result is 0, instead of 0.5, which is wrong (in my program I got nan because I apply a logarithm to the values returned by tnc).
From cuda-memcheck I get this output:

Program hit cudaErrorLaunchOutOfResources (error 7) due to “too many resources requested for launch” on CUDA API call to cudaLaunch.

I am using CUDA 6.5 on a computer with Tesla K20 (CC 3.5).

Thanks for your time.

PS. I know that in the file “tStudent.cu” there is not the best implementation of those functions…I should use constant memory. I will do that, eventually.

First of all you should be doing proper cuda error checking, so you don’t have to rely on cuda-memcheck to tell you when something is going wrong.

the visual profiler can give an indication of the actual number of registers per thread. nvprof can also be used to identify registers per thread at runtime. Use the --print-gpu-trace option, and the registers per thread requested for each kernel launch will be displayed in the “Regs*” column.

“too many resources requested for launch” usually implies either too many registers per thread (times the total number of threads per block - this can’t be determined in the general case until run-time) or too much shared memory requested (also can’t be determined in the general dynamic allocation case until run-time).

To start debugging a “too many resources” issue, start by re-compiling your code with a -maxrregcount switch that is consistent with the max registers per SM for your device:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

divided by the largest kernel launch you have (in terms of threads per block requested). So, using 1024 threads per block on K20, try

nvcc -maxrregcount 48 …

and then see if the “too many resources” error goes away. If it does, then you have confirmed it is a registers per thread issue. You can then use the profiler to track down which kernel is the culprit (that is limited to 48 registers per thread in this example) and use launch bounds to manipulate that kernel directly, and dispense with the -maxrregcount compile switch:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds

Yes, it was definitely a register problem, in fact compiling with -maxrregcount 32 or -maxrregcount 64 makes everything work.

Still, I don’t understand why I don’t get the error if I compile everything in the same compilation unit.

Anyway, thank you very much.

In whole-program compilation, the compiler is able to optimize more aggressively since it “sees” all the code at once. In particular, it is able to inline many, or even all, device functions. This gives the compiler a lot of freedom during register allocation.

When using separately compiled device functions, device functions cannot be inlined until link time. At this time the CUDA device linker does not offer that capability; in general few linkers do. The ABI’s function call convention restricts the register allocation performed by the compiler. Assigning particular registers to particular tasks during function call and return is a feature shared by many ABIs, including x86 and ARM.

As a consequence, code built from separately compiled functions may use more registers and may run slower than the same code built from a single compilation unit.

As for normcdf(): It seemed to me from a cursory look at the source you pointed to that the function ALNORM in that file is computing the CDF of the standard normal distribution, that’s why I mentioned it. You are correct that CUDA does not offer functions to compute the CDF of other distributions.

The actual register usage can easily be different when everything is compiled in the same compilation unit. In that case, the compiler can optimize out (inline) function calls, and do local optimizations around each inlined function. This is not possible in the externally linked case, and can definitely affect (e.g. increase, in this case) register usage.

The basic idea is that the compiled code is actually different in the two cases, and therefore the register usage can be different.

Thanks guys.