So, anybody who’s ever used printf()
to debug GPU kernels must know these frustration:
- If you print something, then print again, the lines won’t appear together since other threads’ printf()'s will likely come in-between;
- Which means that you must combine all of your printing into a single instruction;
- But you can’t do that for a variable-size structure;
- … and you pine for having a
sprintf()
(or a C+±style stringstream).
And that’s not all: What if you want to write a printf wrapper, which, say, identifiers the current thread? You can write a varargs function in CUDA… but unfortunately, there is no vprintf which you can call inside your wrapper. So, you’re stuck with writing a macro. Blech :-(
Finally, maybe you want to flex your printf muscles: printf("%.*s\n", my_string)
for example. or printf("%z\n", my_size);
. Tough cookies, that’s not supported. Not to mention extra features outside of ISO C, like the super-useful support printing in binary.
It’s weird that CUDA has been around for, what, 13 years now, and nobody’s offered this (AFAICT). So, that period is now - almost - over. I’ve recently pushed an implementation of most of the printf()
family of functions to the development branch of my cuda-kat
library.
In a way, this is pretty mature code: It’s a porting of this stand-alone printf
library for embedded systems, so it’s inherited a rather extensive set of unit tests. But even though these now pass when running in GPU kernels - that doesn’t test the behavior in a massively parallel environment.
So: I need some beta testers to try this out. So if you’re doing some kernel development work, and occasionally debug-print stuff… please consider giving it a spin.
Bugs/suggestions can obviously be filed either on the cuda-kat issue page or here.