I wrote a (commercial–free demo) library, the Kappa Library, that gives you access to the driver API plus a scheduler but which is much easier to use than the runtime API (or driver API). As has been mentioned elsewhere (NVIDIA forums and Slashdot), the CUDA APIs require way too many function calls for simple things like allocating and transferring memory.
I would recommend the CUDA runtime API if you are just learning and doing examples and if you need to have, in one place, everything fairly explicitly exposed.
To compile host code and device code separately, just place the device code in a separate ‘.cu’ file (and make sure to use an:
extern “C”
declaration on the device code like the following:
extern “C”
global void mykernel( int *a, int n )
If you are using the driver API (Kappa will do this for you automatically), you can use the following to compile the file to a ‘.ptx’ file that can be JIT compiled using the driver API:
/usr/local/cuda/bin/nvcc -m32 -I. -I/usr/local/cuda/include -O3 -o %s.ptx -ptx %s.cu
(where the %s in the above is replaced with your filename–this compiles for 32bit CUDA even on a 64bit host because of the ‘-m32’ option). For speed, using the driver API, you will want to use ptx files and JIT since this compiles the device code specifically for the GPU it is going to run on (the runtime API does some of this for you and the Kappa library does all of this for you). See the matrixMulDynlinkJIT example in the NVIDIA SDK for more (but you should probably mmap the PTX file instead of how the example does it).
The host code is compiled using normal g++ compilation–the CUDA driver API is just another Linux shared library, ‘-lcuda’, with include files usually included with ‘-I/usr/local/cuda/include’ (except that the shared library is installed by the NVIDIA display driver and the development headers are installed by the CUDA toolkit).
If you wish for your computations to be parallel at a level higher than algorithm steps (i.e. you can build libraries upon libraries that are efficient parallel computation throughout the layers of libraries), then neither the CUDA driver or the CUDA runtime API (or OpenCL or DirectCompute) are very good. An example of this for CUDA is that even usage of the Fermi concurrent kernel execution feature is not generally possible using all CUDA kernels in a program by just using the CUDA APIs. MPI (message passing interface) gives parallel computation at the clustering level and the Kappa Library gives you this at the library component level. If somebody knows about something other than MPI or Kappa that does this and is available for general use, I would be interested to hear about it.