__global__/__device__ site architecture

hello guys,

just spend some weekends with CUDA hacking and have some open questions.

Architecture on host site is clear. Coping data to/from device memory,
starting some kernels, etc. On the host site you are free to use
what ever you want as long gcc/g++ compilable.
I was able to compile my java program with the GNU GCJ and call the kernel
on the device. Its really interessting to use e.g. the networking
stuff from java to receive data which are then processed on the device kernel.

What i still do not know how a complex kernel architecture looks like. Right
now, i’m just calling the kernel using the <<<>>> syntax and the kernel
is currently one file. But with highter complexity there is need to split
the kernel in more files and also include some header files.

I understood that the nvcc is just traversing the gcc (on linux) and generates
from the .cu input .c files which are later then compiled with the gcc. Does
it means that im not able to include any header files on the device side
code ?

Assume the following example :

(Host side code) <<>> (a.cu) (b.cu) (c.cu)

Some host site code calls the kernel which is splitted across the
files a.cu, b.cu and c.cu. a.cu implements the global kernel function and is calling
some device functions. Right now i do not understand how/where the device functions are
implemented. Does a.cu #include “b/c.cu” or should i define a b/c.h where the functions which
are implemented in b/c.cu are defined ? All the functions are inlined,
should they be defined as device inline because they are implemented in another
file then global kernel function itself ?

Maybe you guys already implemented some complex/huge kernels and could give
me some ideas about the best approach.

Thanks,
jj

The easiest solution is to put everything in a single file, or just include everything in a single file.
device seems to be defined into something including an inline, and won’t work unless included.
nvcc splits a .cu into .c and .gpu. After .gpu is compiled, the resulting cubin is inserted into .c as a constant array. At run time, they’re handled by generated CUDA run time API calls. Those calls are hardcoded to launch kernels in their own module, and invoke across files this way is impossible. Even if the driver API is used, it’s still easier to write all kernels in one file since the GPU side compiler chain doesn’t have a linking stage, and managing many modules is rather painful.

Not sure if I understand your question correctly…

I currently have kernel split among 3 .cu files: kernel.cu, file1.cu and file2.cu.
kernel.cu contains actual kernel code which uses device functions defined in file1.cu and file2.cu.
Files file1 and file2 are simply #include’d from kernel.cu. There’s no need to feed file1.cu and file2.cu to nvcc, as there is no linker or something.
You can split your kernel in many files, but you have to use preprocessor directives to combine them into one big .cu file befeore feeding it to nvcc.

Hope this helps =)

P.S> device functions are always inlined, no need for explicit inline specifier.

You CAN split different kernels into multiple different files and compile them with nvcc separately. The problem is that they cannot use the same textures or constant memory (perhaps a few other things, too). So, for multiple cooperating kernels spread across files, it is best to do what has already been suggested: include all .cu files into one “big.cu” and only compile that file with nvcc. It is a bit of a kludge, but is a requirement so that multiple files can access the same texture.

hmm, this big-cu-file kernel is really a hack, but i have to live with that. so there are
no .h files involved but the .cu are directly included.

what about dependency ? in my previous example there is a file a.cu which contains the
global kernel function and the b.cu and c.cu files. all of the methodes have a device signature and the inline is automaticaly added.

in case a function in b.cu needs to call another function in c.cu where should be the
definition ? should i use some forward declaration, eg. in the kernel a.cu file ?

what about datatypes which are used on the device and host site ? can those types at least be included ?

cheers,
jj

For forward declarations rules are the same as for traditional C/C++: function must be defined or declared before it is used in another function.

You can include c.cu before b.cu, or (if there are cross-dependencies) you can add forward declaration of function from c.cu to beginning of b.cu. Actually, nothing stops you from creating files a.h, b.h and c.h with declarations of functions defined in each file and including them as needed.

Um, sorry, I didn’t understand this question.

thanks guys…