printf() in a kernel: the manual says yes

I would like to use printf() in some kernels for diagnosing some problems copying vectors to and from device memory. Simply calling printf() yields a compiler error that insists calling a host function from a device/global function is not allowed, yet the CUDA C Programming Guide Version 3.2 has an example of a kernel using printf() p.121-122, and there is a discussion of the various considerations, but I don’t understand what else is necessary to make the example work. Could someone spell it out for me? Is there an example somewhere in the SDK?

ETA: I see there are indeed a couple of projects in the SDK concerning kernel printing.