Separate Compilation and Linking of CUDA C++ Device Code

Originally published at:

Managing complexity in large programs requires breaking them down into components that are responsible for small, well-defined portions of the overall program. Separate compilation is an integral part of the C and C++ programming languages which allows portions of a program to be compiled into separate objects and then linked together to form an executable or…

Above "main.cpp" is missing some lines.

You're right! Thanks for catching that. I added in the missing lines.

Hi Tony
I am trying to run cuda codes on Amazon AWS. I am using the Kmeans cuda implementation by Serban Giuroiu where different functions are written in different files.

I just need to alter few parameters and run the code with my data. Now, the default object file is running fine. But when I am trying to compile .cu files after changing the parameters. I am getting errors. I can give more details about the errors. nvcc -c command is creating objects but these objects are not running and giving errors.

I have a file from which it is calling a function cuda_kmeans() which is written in file, where I need to change a few values. Any suggestion how should I compile them using command line ? I'll appreciate any suggestion.

I assume you're talking about I checked this out and as it is in the repo, the makefile is not setup to use separate compilation units. Make sure to add the -dc option when building the objects . If you're linking with nvcc that should cover it. My suspicion is that you are calling one of the __device__ routines in from a kernel in If this is not the problem comment back and I'll help you dig deeper. Thanks!

-Tony Scudiero

Thanks Tony for your reply. After changing the code if I 'make' it again, it seems to be working fine.
Thanks agian.

How do you set this build up with Nsight Eclipse edition? I can build with your makefile but not from Eclipse. I posted all the details on SO:

Great. Nice reference for novice of CUDA programming.

Just for completeness, in the section 'Advanced Usage: Using a Different Linker', the interested reader should perform the following commands:

# object files
nvcc -x cu -arch=sm_20 -I. -dc main.cpp -o main.o
nvcc -x cu -arch=sm_20 -I. -dc particle.cpp -o particle.o
nvcc -x cu -arch=sm_20 -I. -dc v3.cpp -o v3.o

# gpu object file
nvcc -arch=sm_20 -dlink v3.o particle.o main.o -o gpuCode.o

# final link
g++ -L/usr/local/cuda-7.5/lib64 gpuCode.o main.o particle.o v3.o -lcudart -o app

The last command need the -L option to define the location of in order to avoid problems.

I try to run the code in MVS 2013 and it turns out to have unresolved extern function for the advance in the particle, why is that? Thank you.

That error indicates that you haven't told the CUDA compiler to generate relocatable device code - i.e. linkable device code. In your project property pages, go to CUDA C/C++ ->Common and look for the field "Generate Relocatable Device Code" and use the pull-down menu to set this to "Yes (-rdc=true)." You should then be able to build and run.

It wasn't evident to me at first, but after reading it again, I wanted to note, for the NVIDIA CUDA developer community, that you can use EITHER __host__ or __device__ decorations (prefixes) before your class method. It was useful in my code that I was working on to use a class from __global__ and I successfully compiled that when the class method had a __device__ decoration (prefix). Also, what I found is that if I wanted, on the device, for the class to "instantiate" or "contain" some arrays, objects, ints (integers) etc., then that "instantiation" method (function) in the class needs to be decorated with __device__ as well. I noted this on my to get the good word out:

Thanks Murphy and Scudiero for the article! And I find Harris' articles and github repo to be super useful as well.

You can also combine `__host__` and `__device__` on a single method.

I obtain this warning when I combine __host__ and __device__ when I run my make file (makefile):
physlib/ warning: a __device__ function("dev_Grid3d::dev_Grid3d") redeclared with __host__ __device__, hence treated as a __host__ __device__ function

physlib/ warning: a __device__ function("dev_Grid3d::flatten") redeclared with __host__ __device__, hence treated as a __host__ __device__ function

My program is in this subdirectory, as I implemented finite volume upwind method for convection in 3-dimensions on CUDA C/C++:
and the C++ class that I wanted to run on the device is here:

In general for good programming practice, when obtaining compiler WARNINGS, is it always best to make changes until they go away? They weren't errors and my executable did what I wanted.

* tangent side note 1 - Dr. Harris, as I was interested (passionate) about implementing on the GPU combustion CFD, I was going through, from the graduate school level up, the "standard" ("canonical") computational methods for aerospace engineering (finite difference, finite volume, etc.), and I found most solvers implemented in 2-dims. - why not in 3-dim. since we obviously live in a 3-dim. world?

*tangent side note 2 - @Mark_Harris:disqus I tried increasing from 64 to 92 and above in your blog post about finite difference methods (3-dim.) for the grid size dimension and obtain Segmentation Faults. Why doesn't it scale? I don't think it's a RAM (memory) problem. cf.

Yes, it's important to fix warnings. In this case you need to make the declaration (in .h) and definition (in .cu) of these methods match (__host__ __device__ on both).

Tangent 1: Probably because it's simpler to explain and diagram 2D implementations.

Tangent 2: I will look at the error. Please don't cross-post.


It's my fault : __host__ __device__ works for a single method and you were right @Mark_Harris:disqus , __host__ __device__ has to be BOTH in the declaration and definition.

Thanks for all your help!


I have a question about the program. I have done a change on the number of steps in the main function from 100 to 500000 and the program crashed.

Why the program crash? Is there a kernel launch limit?

I have a Tesla k20m card.

Thank you.

Hi Albert, there shouldn't be any limit that would cause it to crash under that change. I tried it myself on my laptop and it seems to run fine -- I could only wait for about 50,000 iterations but I see nothing in the code that could cause it to crash after more, other than random bit errors corrupting memory. It took several minutes to run 50,000 iterations on my laptop GPU (a few years old).

Hi Mark, thanks for reply. I tried the program in other workstation with different SO and different GPU and seems that the problem is on my Tesla or SO. I'm investigating about this. Thank you!

Hi Mark,

I found the problem, it's a temperature problem... Do know if it is possible buy the heat-sink for the Tesla K20m?

Tank you.