just spend some weekends with CUDA hacking and have some open questions.
Architecture on host site is clear. Coping data to/from device memory,
starting some kernels, etc. On the host site you are free to use
what ever you want as long gcc/g++ compilable.
I was able to compile my java program with the GNU GCJ and call the kernel
on the device. Its really interessting to use e.g. the networking
stuff from java to receive data which are then processed on the device kernel.
What i still do not know how a complex kernel architecture looks like. Right
now, i’m just calling the kernel using the <<<>>> syntax and the kernel
is currently one file. But with highter complexity there is need to split
the kernel in more files and also include some header files.
I understood that the nvcc is just traversing the gcc (on linux) and generates
from the .cu input .c files which are later then compiled with the gcc. Does
it means that im not able to include any header files on the device side
Assume the following example :
(Host side code) <<>> (a.cu) (b.cu) (c.cu)
Some host site code calls the kernel which is splitted across the
files a.cu, b.cu and c.cu. a.cu implements the global kernel function and is calling
some device functions. Right now i do not understand how/where the device functions are
implemented. Does a.cu #include “b/c.cu” or should i define a b/c.h where the functions which
are implemented in b/c.cu are defined ? All the functions are inlined,
should they be defined as device inline because they are implemented in another
file then global kernel function itself ?
Maybe you guys already implemented some complex/huge kernels and could give
me some ideas about the best approach.