Multiple definition error on device function in header file

I have a device function that is defined in a header file. The reason it is in a header file is because it is used by a global template kernel, which needs to be in a header file since it is a template kernel.

When this header file is included across 2 or more .cu files, I get a LNK2005 error during linking:

[indent] [/indent] I checked the generated .ptx file and found that fooKernel does not even call getCurThreadIdx, it has been inlined! If this is the case, why is there a multiple definition error? External Image

I can make it go away if I add a forceinline qualifier to the device function. But, my doubt about the normal inlining of device function here still remains.

Another reason I am concerned is that this means that the device functions used by a template kernel need to be always forceinline! This severely limits the amount of functionality that can be achieved inside a template kernel, especially on the Fermi architecture! External Image

Here is sample code to produce the above error on CUDA 3.2 and Visual Studio 2008:

FooDevice.h:

#ifndef FOO_DEVICE_H

#define FOO_DEVICE_H

__device__ int getCurThreadIdx()

{ return ( ( blockIdx.x * blockDim.x ) + threadIdx.x ); }

template< typename T >

__global__ void fooKernel( const T* inArr, int num, T* outArr )

{

	const int threadNum = ( gridDim.x * blockDim.x );

	for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )

 	outArr[ idx ] = inArr[ idx ];

	return;

}

__global__ void fooKernel2( const int* inArr, int num, int* outArr );

#endif // FOO_DEVICE_H

FooDevice.cu:

#include "FooDevice.h"

// One other kernel that uses getCurThreadIdx()

__global__ void fooKernel2( const int* inArr, int num, int* outArr )

{

	const int threadNum = ( gridDim.x * blockDim.x );

	for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )

 	outArr[ idx ] = inArr[ idx ];

	return;

}

Main.cu:

#include "FooDevice.h"

int main()

{

	int num = 10;

	int* dInArr = NULL;

	int* dOutArr = NULL;

	const int arrSize = num * sizeof( *dInArr );

	cudaMalloc( &dInArr, arrSize );

	cudaMalloc( &dOutArr, arrSize );

	// Using template kernel

	fooKernel<<< 10, 10 >>>( dInArr, num, dOutArr );

	return 0;

}

Remember that all CUDA definitions must be compiled in the same logical compilation unit. You are not doing that and that is why there are duplicate symbols errors. Don’t compile FooDevice.cu. Ever. Import it into main.cu. That is the underlying source of the error.

As for why you see the error on device functions, it is because they are both inlined and compiled seperately for sm20. The seperate version is for providing function pointer support, which is why they get named as symbols in the resulting cubin elf output, just like textures, and constant and global memory declarations.

Thanks for that reply! External Image

Do you mean that all CUDA code should end up being included in the same compilation unit in a program? Could you point me to some CUDA documentation that explains this? I could not find any information about this.

Also, won’t this make CUDA code management complicated and result in long compilation times since code cannot be broken up into compilation units?

Could you explain what is this function pointer support and why it is needed? Also, why is the compiler still keeping this separate version around even though (1) it can clearly see that the separate version is not called anywhere in a compilation unit (2) device functions cannot be called from another compilation unit anyway?

To answer the first part, yes it is something of a limitation, but without a linker for device code there is no alternative. The switch to an ELF fat binary format in CUDA 3 might be the first step towards a device code linker, but that brings a lot of other issues that are beyond the scope of your question. As for complexity, there are some extremely well designed and large CUDA template libraries (like thrust) which show how cleanly and elegantly templates can be used in CUDA if you know what you are doing.

To answer the second question, function pointers are a pretty fundamental feature of C and C++. The whole C++ class model is predicated on function pointers. To get some idea of how they can be used, look at the SDK example in recent toolkit releases. The standalone device functions are present in compiled code for C++ and function pointer support. The fact they are there should not matter to properly designed and compiled code.

I have code where the kernel call and kernel definition are in separate compilation units and is working fine. Do you mean to say that these should be combined into the same compilation unit?

Here is an example of what is working fine for me:

//////////////////////////////////////////////

// Host.cu

#include "Kernel.h"

void doSomeCuda()

{

	doSomeCudaKernel<<< x, y >>>( inArray, outArray );

}

//////////////////////////////////////////////

// Kernel.h

__global__ void doSomeCudaKernel( const int*, int* );

//////////////////////////////////////////////

// Kernel.cu

__global__ void doSomeCudaKernel( const int* inArray, int* outArray)

{

	// Do some kernel work

}

//////////////////////////////////////////////