printf() in a kernel: the manual says yes

I would like to use printf() in some kernels for diagnosing some problems copying vectors to and from device memory. Simply calling printf() yields a compiler error that insists calling a host function from a device/global function is not allowed, yet the CUDA C Programming Guide Version 3.2 has an example of a kernel using printf() p.121-122, and there is a discussion of the various considerations, but I don’t understand what else is necessary to make the example work. Could someone spell it out for me? Is there an example somewhere in the SDK?

Hardware and software as in the sig.

ETA: I see there are indeed a couple of projects in the SDK concerning kernel printing.