Compiling / linking CUDA apps?

I hope you guys can excuse my making a total noob post, but I can’t seem to figure out how to compile a program that has a CUDA component (other than the SDK examples, which compiled fine).

I have a very simple .cu file, which I hope will work (I don’t care if it’s efficient at this point, as I just want to see something come out of the GPU). All it does is a simple blur on a fixed-size monochrome image:

__global__ void simpleBlur( float*, float* );

void blur( float* inputImg, float* outputImg ) {

int imgSize = 1024 * 650 * sizeof( float );

float *devIpt;

float *devOpt;

cudaMalloc((void**)&devIpt, imgSize);

cudaMalloc((void**)&devOpt, imgSize);

cudaMemcpy( devIpt, inputImg, imgSize, cudaMemcpyHostToDevice );

dim3 dimBlock( 9, 9 );

dim3 dimGrid( 1016, 642 );

simpleBlur<<<dimGrid, dimBlock>>>( devIpt, devOpt );

cudaMemcpy( outputImg, devOpt, imgSize, cudaMemcpyDeviceToHost );


__global__ void simpleBlur( float* input, float* output ) {

int blkOffset = blockIdx.x + 1016 * blockIdx.y;

output[ blkOffset ] += input[ blkOffset + threadIdx.x + 1016 * threadIdx.y ];



I can get it to compile into an object file using NVCC, but I’m not sure what includes to use / libraries to link from the main module (the linker gives me a bunch of not-defined errors, so I’m definitely missing something). Trying to insert it in with the rest of the code and compile it all with nvcc gives me some error about exception handling being disabled.

So, can anybody suggest a simple way to access a CUDA kernel from another program, without modifying that program too much (preferably by linking them pre-compiled, or making a shared object)?

I have plugins for the gimp working fine, so it’s definitely possible to integrate with anther code base. To be fair, the gimp plug-in architecture does make this pretty straight forward.

It doesn’t sound like you’ve got to a point where your code is the obstacle, so can you post more information about the environment your trying to integrate into? Is it C, C++? Have you got all of the source, or are there pre-built libraries that you have to link with? Does it require a specific version of the C compiler? etc.

From that error message, I wonder if you’re you trying to mix C++ code into a .cu file? Don’t bother unless you feel confident with the edges between the CUDA C++ subset and your existing code. Just try to stay with the existing code structure; circle back later and fix any unbearable ugliness after you have the full, end-to-end use case working.

I think there are a couple of approaches depending on where the build complexity exists. Can you build other code/application without CUDA components but with a dummy host-side ‘blur’ okay? If the build system (e.g. makefile) for the ‘other’ code isn’t too complex, it may be practical to add all of its compile and link flags, -I include paths, libraries and library paths into the CUDA SDK

In my case, the easiest route was this approach. I keep device-side code separate, in .cu files (in the style you have posted) and added the includes and libraries for the gimp plug-in stuff into a makefile based on from the CUDA SDK. I just left the ‘native’ plug-in code in their ‘normal’ files, and got compilation and linking working in a simple ‘brute force’ way.

If you’re mixing C++ in, have a look at that SDK example. In my case, it’s all C, but I did get caught by C++ name mangling at one stage in one of my experiments (but I can’t remember what that was, sorry). Anyway, it’s all C or .cu now, with no C++ name mangling.

HTH, but post a bit more detail if you’re still stuck.


PS - I’m off to my Mum’s, with no internet access, so I it’s not that I’m ignoring you … :)

PPS - Hopefully more detail of your build environment will help garner more concrete advice while I’m away!

P*S - Once you get build working, I think you’ll have issues with the blur kernel. 81 threads will be trying to do a += to the same location simultaneously. But don’t worry until the build works.

Thanks for the reply!

The code itself is in C, but apparently something that’s getting included somewhere is in C++ (I think the error message was tied to ‘classes.h’).

The “real” program that I want to be able to get a CUDA component into is a computer-vision project that uses SDL for (2D) graphical output, and will also eventually use V4L and the Mplayer libraries for video input. It dynamically links to a bunch of shared objects, and is pretty heavily multi-threaded on the host side (the Core-2 Duo processors love it when you make every task into two little threads).

For some reason, the examples in the SDK look completely different from any C I’ve ever used before (I’m an electrical engineer, not a programmer). Everything in the CUDA manual about parallel processing and host-device separation looks pretty straightforward, but the compiler directives and makefiles always confused me (I have no idea what the @PARAM’s are, for example). This is another reason I’d prefer to avoid using the templates; if I can figure out what to #include so I can just link in the object file that nvcc gives me, then I can stay in familiar territory.

As for the blur kernel, it’s the simplest thing I could think of that my existing code can get a result back from, just to see that the video card did something (thus, the 81 read-writes to DDR memory don’t bother me at this stage). I have some of the “real” applications partly laid-out, but they’re not something I want to be debugging until I have the compilation figured out.


I got past all the undefined reference errors by adding “-lcuda -lcudart”, but now gcc is giving me an undefined reference error for “blur” (my function).

Following the advice in this thread:

I added the extern “C” { … }, so the name wouldn’t get mangled, and when I check it with readelf, the function looks like it’s present under its own name:

  179: 000038be   243 FUNC    GLOBAL DEFAULT    1 blur

Does anyone know why LD is not finding it?

Here’s what I’m using to compile:

g++ -o test2 blurkernel.o cudablur0.cpp -lSDL -lSDLmain -lSDL_image -L/usr/local/cuda/lib -lcuda -lcudart                      

Ok, I have some more information:

I can get a CUDA component to link with another module if (and only if) the main() function is inside the .cu file. If the entry point is in another source file, it won’t link even if I supply NVCC with the .cu file and a .o file compiled in GCC. The function that talks to the G80 doesn’t have to be called from inside that file, so I don’t think it’s being optimized away. Also, this linkage thing is only a problem with objects that were compiled by NVCC–I went and tried the exact same procedure using only GCC (just to see if I was losing my mind), and it worked fine.

Here’s the C++ module:

#include <iostream.h>

void fourops( float, float, float* );

int hack() {

float a, b;

float c[ 4 ];

cout << "The two floats?\n";

cin >> a;

cin >> b;

fourops(a, b, c);

cout << "\n\n" << c[ 0 ] << "  " << c[ 1 ] << "  " << c[ 2 ] << "  " << c[ 3 ] << "\n";

return 0;


And the .cu file:

#include <stdio.h>

__global__ void addtest( float, float, float* );

__host__ void hack( void );

void fourops( float iptA, float iptB, float* target ) {

float *devOpt;

//float result[ 4 ];

cudaMalloc((void**)&devOpt, 4 * sizeof( float ));

dim3 dimBlock( 1, 1 );

dim3 dimGrid( 1, 1 );

addtest<<<dimGrid, dimBlock>>>( iptA, iptB, devOpt );

cudaMemcpy( target, devOpt, 4 * sizeof(float), cudaMemcpyDeviceToHost );


int main( int argc, char* args[] ) {


return 0;


__global__ void addtest( float A, float B, float *output ) {

output[ 0 ] = A + B;

output[ 1 ] = A - B;

output[ 2 ] = A * B;

output[ 3 ] = A / B;


I compiled like this:

g++ -c cplusplusmodule.cpp

nvcc -o test4 cplusplusmodule.o

Moving the main() into the other file causes me to get an error from ld telling me that fourops() is not defined, no matter what else I do (yes, I had an extern “C” in there before). Maybe somebody else can try this and see if it’s just me? I’m using an unsupported Linux distro, so maybe that’s the issue?

Check the mangled name of the function in the cplusplusmodule.o

This is for example what is reported by nm on my machine:

$ nm cplusplusmodule.o |grep hack
00000246 t _GLOBAL__I__Z4hackv
000000fc T _Z4hackv

$ nm cudamodule.o |grep hack
U _Z4hackv

As you can see on RHEL4, the names match and your code worked fine.

Which version of g++ are you using?
As a workaround, you could hardcode the mangled name in your .cu file.

Moving the main from the .cu to the .cpp worked too.

nvcc -c g++ cplusplusmodule.cpp cudamodule.o -L/usr/local/cuda/lib -lcudart

$ ./a.out
The two floats?

3 -1 2 0.5

I’m using GCC 4.1.1 and ld version

The OS is PCLINUXOS 2007, with kernel

It’s not the name-mangling, as I originally had an extern “C” { … } around the host function in the .cu file. Checking it with nm gave me back the correct function name with no prefix (again, this is where I was starting to question my sanity). For some reason, ld just won’t see the host functions in an nvcc .o file, unless there’s a main() in it.

I also have the problem when passing two .cu files to nvcc, where one of them has a host function but no main(). Compiling the two files below with “nvcc -o test” fails on this system at the link stage.

void test( void );

int main( int argc, char* args[] ) {


return 0;



#include <stdio.h>

__global__ void addtest( float, float, float* );

extern "C" {

void test( void ) {

float *devOpt;

float result[ 4 ];

cudaMalloc((void**)&devOpt, 4 * sizeof( float ));

dim3 dimBlock( 1, 1 );

dim3 dimGrid( 1, 1 );

addtest<<<dimGrid, dimBlock>>>( (float)5.145, (float)7.3, devOpt );

cudaMemcpy( &result, devOpt, 4 * sizeof(float), cudaMemcpyDeviceToHost );

printf ("5.145 + 7.3: %f \n", result[ 0 ] );

 printf ("5.145 - 7.3: %f \n", result[ 1 ] );

 printf ("5.145 * 7.3: %f \n", result[ 2 ] );

 printf ("5.145 / 7.3: %f \n", result[ 3 ] );



__global__ void addtest( float A, float B, float *output ) {

output[ 0 ] = A + B;

output[ 1 ] = A - B;

output[ 2 ] = A * B;

output[ 3 ] = A / B;


I also got the blur thing to work using the main() hack (plus some bug fixes), as well as a cufft program, so I’m technically good to go (it just means I need to keep all the CUDA stuff in one huge .cu file, or hard-code them together with #include).

Let me know if there’s any more information I can provide with regard to the linking weirdness.

nvcc is mangling the name of the test function when you compile

nm main.o |grep test

     U _Z4testv

Add an extern C, and it will work ( it did it for me)

extern "C"


void test( void );


int main( int argc, char* args[] ) {


return 0;



5.145 + 7.3: 12.445000

5.145 - 7.3: -2.155000

5.145 * 7.3: 37.558502

5.145 / 7.3: 0.704795

Aw, crap >.<

I must be freakin’ retarded…

Thanks for the help!

I’m new to this stuff. I’m working on a large cpp project that involves .cpp, .h, and data files stored across several folders. I’m using RedHat Linux with CUDA 2.2.
I tried to include a CUDA function (defined in a .cu file) in my cpp code. I modified the Makefile and got bazillion errors. Then I read your posts and tried to first generate the object files with nvcc and then link all them together with g++. It didn’t work. I thought it was something with my code, so in order to discard that, I took the ‘cppIntegration’ files and generated the object files with nvcc and tried to link all them together with g++. IT DOES NOT WORK!

I read your posts and applied all your suggestions (extern “C” for instance, although that’s done in the ‘cppIntegration’ files).

This is my series of commands:

nvcc -arch sm_13 --host-compilation c++ -I. -I/usr/local/cuda/include -I/home/drl/NVIDIA_CUDA_SDK/common/inc -L/usr/local/cuda/lib -L/home/drl/NVIDIA_CUDA_SDK/lib -L/home/drl/NVIDIA_CUDA_SDK/common/lib/linux -c *.cu

g++ -o svm *.o *.cpp -I. -I/usr/local/cuda/include -I/home/drl/NVIDIA_CUDA_SDK/common/inc -L/usr/local/cuda/lib -L/home/drl/NVIDIA_CUDA_SDK/lib -L/home/drl/NVIDIA_CUDA_SDK/common/lib/linux -lcuda -lcudart -lcutil

And this is the result I got:

cppIntegration.o: In function kernel2__entry(int2*)': tmpxft_00005ed7_00000000-14_cppIntegration.ii:(.text+0x1c8): multiple definition of kernel2__entry(int2*)’
cppIntegration_kernel.o:tmpxft_00005ed7_00000000-22_cppIntegration_kernel.ii:(.text+0x1c8): first defined here
cppIntegration.o: In function kernel__entry(int*)': tmpxft_00005ed7_00000000-14_cppIntegration.ii:(.text+0x238): multiple definition of kernel__entry(int*)’
cppIntegration_kernel.o:tmpxft_00005ed7_00000000-22_cppIntegration_kernel.ii:(.text+0x238): first defined here
cppIntegration.o: In function __device_stub__Z7kernel2P4int2': tmpxft_00005ed7_00000000-14_cppIntegration.ii:(.text+0x170): multiple definition of __device_stub__Z7kernel2P4int2’
cppIntegration_kernel.o:tmpxft_00005ed7_00000000-22_cppIntegration_kernel.ii:(.text+0x170): first defined here
cppIntegration.o: In function __device_stub__Z6kernelPi': tmpxft_00005ed7_00000000-14_cppIntegration.ii:(.text+0x1e0): multiple definition of __device_stub__Z6kernelPi’
cppIntegration_kernel.o:tmpxft_00005ed7_00000000-22_cppIntegration_kernel.ii:(.text+0x1e0): first defined here
/tmp/ccs8H011.o: In function computeGold': cppIntegration_gold.cpp:(.text+0x0): multiple definition of computeGold’
cppIntegration_gold.o:cppIntegration_gold.cpp:(.text+0x0): first defined here
/tmp/ccs8H011.o: In function computeGold2': cppIntegration_gold.cpp:(.text+0x40): multiple definition of computeGold2’
cppIntegration_gold.o:cppIntegration_gold.cpp:(.text+0x40): first defined here
/tmp/cckgma36.o: In function main': main.cpp:(.text+0x6c): multiple definition of main’
main.o:main.cpp:(.text+0x6c): first defined here
collect2: ld returned 1 exit status

This is really driving me crazy. I tried everything and it seems that there is a problem with the kernel functions!.
What really bugs me is that the cppIntegration projects works with the supplied Makefile.
Please, I really need your help.