Should I program with Driver API? newbie here

I have a 470 running on a 64-bit Ubuntu. I am planning to write some computationally intensitve programs for internal use. The Programming Guide told me I can make my program faster if I compile my device code in 32-bit mode and host code in 64-bit mode. It says this can only be done with Driver API. Is it true I am better off programming with Driver API?

If so, how do I compile host code and deivce code separately? It is not that clear to me how to do this in chapter 3.1. Can someone show me an example? Thanks a lot!

If you are just beginning to learn CUDA, I’d suggest to start with the runtime API (“CUDA C”).

If you find out that your kernels are register-constrained and that 32-bit code could help you, you can still switch to the driver API later. And then it helps to know that your kernel and data copying logic already works.

I wrote a (commercial–free demo) library, the Kappa Library, that gives you access to the driver API plus a scheduler but which is much easier to use than the runtime API (or driver API). As has been mentioned elsewhere (NVIDIA forums and Slashdot), the CUDA APIs require way too many function calls for simple things like allocating and transferring memory.

I would recommend the CUDA runtime API if you are just learning and doing examples and if you need to have, in one place, everything fairly explicitly exposed.

To compile host code and device code separately, just place the device code in a separate ‘.cu’ file (and make sure to use an:

extern “C”

declaration on the device code like the following:

extern “C”
global void mykernel( int *a, int n )

If you are using the driver API (Kappa will do this for you automatically), you can use the following to compile the file to a ‘.ptx’ file that can be JIT compiled using the driver API:

/usr/local/cuda/bin/nvcc -m32 -I. -I/usr/local/cuda/include -O3 -o %s.ptx -ptx %s.cu

(where the %s in the above is replaced with your filename–this compiles for 32bit CUDA even on a 64bit host because of the ‘-m32’ option). For speed, using the driver API, you will want to use ptx files and JIT since this compiles the device code specifically for the GPU it is going to run on (the runtime API does some of this for you and the Kappa library does all of this for you). See the matrixMulDynlinkJIT example in the NVIDIA SDK for more (but you should probably mmap the PTX file instead of how the example does it).

The host code is compiled using normal g++ compilation–the CUDA driver API is just another Linux shared library, ‘-lcuda’, with include files usually included with ‘-I/usr/local/cuda/include’ (except that the shared library is installed by the NVIDIA display driver and the development headers are installed by the CUDA toolkit).

If you wish for your computations to be parallel at a level higher than algorithm steps (i.e. you can build libraries upon libraries that are efficient parallel computation throughout the layers of libraries), then neither the CUDA driver or the CUDA runtime API (or OpenCL or DirectCompute) are very good. An example of this for CUDA is that even usage of the Fermi concurrent kernel execution feature is not generally possible using all CUDA kernels in a program by just using the CUDA APIs. MPI (message passing interface) gives parallel computation at the clustering level and the Kappa Library gives you this at the library component level. If somebody knows about something other than MPI or Kappa that does this and is available for general use, I would be interested to hear about it.

er, why do you think the driver API requires too many calls to allocate or transfer memory? the transfer primitives are the same between driver and runtime.

(now if you had said “too many calls to launch a kernel,” you’d be 100% accurate)

Thanks a lot for your reply. I think I will write some toy programs in runtime API first then

I don’t really get what you’re trying to say here either. Can you elaborate?

Put most simply: since the CUDA APIs do not have any host-side scheduler, the only way to ensure correct program functioning is to put in synchronization points between CUDA kernel launch algorithm steps (or, automagically, OpenMP parallel regions). This,causes, always on the host and sometimes on the GPU, parallel execution to stop at these synchronization points. You no longer have a parallel execution library–you have a library or program with spots that are parallel. If you do not understand that it does not need to be this way then you have not really looked at the Kappa library (as at least one example of how it does not need to be this way).

Beyond this however, in order to build up layers of libraries, resources need to be able to be passed between independent components of independent libraries and library functions need to be able to be made up of other library functions without knowing (all) the details of how those functions are implemented. A simple example of how this does not currently work too well, just using the CUDA APIs, is data flow–is the data where you need it and what do you have to know in order to get it where you need it? Another example is kernel launch sizing–what is the GPU usage state of the library components that you did not write–is some small kernel continually blocking your algorithm? If the code you are using is all yours, then these are not (major) problems–look at the code and change it. If it is a library, do you have the code and why do you have to look at the code for somebody else’s library?

NVIDIA is making good progress towards this on the GPU side of this but (maybe appropriately) has not offered (much–any?) on the host side besides the recent interoperability of the NVIDIA libraries such as the driver/runtime and CUBLAS and CUFFT. (That last statement may sound like I do not appreciate what NVIDIA is doing with this–the exact opposite is very true.) The reason I say that it may not be appropriate for NVIDIA to offer this on the host side is that a true solution for developers of parallel libraries would also embrace OpenMP and other CPU parallel techniques so that those can be used if the library algorithm steps demand it. (There probably are at least a few algorithm steps, somewhere, that benefit from running on a CPU instead of a GPU :whistling: .) If NVIDIA has now decided that they are now, in a major way, a software company and not just a hardware company, then it may be appropriate for them to offer this.

Actually I do not think the driver API requires any more calls than the runtime–it is just my opinion (and a lot of other peoples opinion) is that the runtime API, as a higher level API, should do it with less–it is a major reason people give for backing off of getting into CUDA development.

I find that I need to install libc6-dev-i386 to get device code compiled in 32-bit

sudo apt-get install libc6-dev-i386

Finally get VecAdd to compile with 32-bit device code and 64-bit host code. I need to divide the pointer sizes by 2 in vectorAdd.cpp.

void* ptr;
ptr = (void*)(size_t)d_A;
ALIGN_UP(offset, __alignof(ptr)/2);
error = cuParamSetv(vecAdd, offset, &ptr, sizeof(ptr)/2);
if (error != CUDA_SUCCESS) Cleanup(false);
offset += (sizeof(ptr)/2);
ptr = (void*)(size_t)d_B;
ALIGN_UP(offset, __alignof(ptr)/2);
error = cuParamSetv(vecAdd, offset, &ptr, sizeof(ptr)/2);
if (error != CUDA_SUCCESS) Cleanup(false);
offset += (sizeof(ptr)/2);
ptr = (void*)(size_t)d_C;
ALIGN_UP(offset, __alignof(ptr)/2);
error = cuParamSetv(vecAdd, offset, &ptr, sizeof(ptr)/2);
if (error != CUDA_SUCCESS) Cleanup(false);
offset += (sizeof(ptr)/2);