I’m using Cuda 4.0 and I’m having trouble using device functions the way that I need to use them. I was hired to speedup the processing of part of a large system written in ‘C’. There are hundreds of file, but the functions that I need are probably in 12-15 of these files. The files have .h and .c pairs (of course, this is standard in any large system). Now, I was thinking that if I add device in front of the prototype in the .h and in front of the function implementation in the .c (and I rename it to .cu), then I should be able to call them from my global kernel function. I was having a lot of trouble getting this to link. So I wrote an extremely simple vector add application in Windows 7 and Cuda 4.0. Where normally, you would just c[i] = a[i] + b[i];, I implemented a simple add function with a prototype in cudaUtils.h and implementation in cudaUtils.cu with the device, etc. So now, my kernel function looks like c[i] = cudaUtils_plus (a[i], b[i]); just to test out the file structure style.

The compiler/linker tells me that:
error : External calls are not supported (found non-inlined call to _Z14cudaUtils_plusii)

Now, if I #include “cudaUtils.cu” instead of “cudaUtils.h”, it links and runs correctly. Or, if I move the implementation from the .cu to the .h (even without the ‘inline’ directive it also works)

I don’t think that this is practical when taking selected functions from a dozen files where each .cu is going to include 1 or more other .cu files. I’m sure that I would end up with multipli-defined complaints.

Is this truly a limitation of the CUDA compliation system that you can’t split up an implementation of a kernel program across multiple files? I looked through the examples in the SDK, and I couldn’t find one that uses the .h and .cu method that I need. Some have a .cuh file, but typically they just have a typedef struct that will be used in the kernel, but no device function prototypes.

Any help here?


Yes. CUDA does not have a linker on the device side. You need to include all code into a single compilation unit.

I don’t think that a dynamic linker on the device is the issue. I’m not trying to produce .o’s or .a’s or .lib’s for the device to take care of. It looks to me like the only thing that works is if all CUDA device code is textually within 1 source file either by writing it that way or by including the source functions in an included .h or included .cu file.

Is this what you mean? That all CUDA device code has to be basically in a single source file?

Yes. I’m not particularly referring to a dynamic linker - CUDA doesn’t have a static linker for device code either.

So your entire CUDA kernel has to reside in a single .o?


Err… Then how can NVidia supply all those libraries of CUDA functions?

And wouldn’t it be really nice if the NVidia developers got to work and implemented one of the most elementary parts of a software management package?

I’ve not seen Nvidia provide any library of device functions without source code.

The second question I cannot really answer. So far it doesn’t seem to be an important omission to me (although the question pops up regularly on the forums).
CUDA kernels don’t usually get that large that you’d need to split them into different compilation units. And at the present state of the CUDA ecosystem, it seems difficult to me to provide nontrivial “black box” device functions that one could just link into ones kernels and get close to optimal performance. But with the improvements in each generation of devices that bring the GPU programming model and performance characteristics closer to CPUs Nvidia seems to be heading there. And at some time in the future we’ll probably get a device side linker as well.