Guidance regarding output processing

I wouldn’t recommend printing from device code except for debugging purposes. Why not simply copy the buffer from the device to the host first? You can apply whatever post-processing is desired or required after that.

FWIW, I recently posted a minimal example of printing from device code into a file here: