Effective Parallelisation of CUDA C code

I’m not sure why that would be. I generally don’t have trouble using printf in-kernel.

Before entering what I would call “typical debug”, I would ensure that:

  1. I am doing rigorous error checking, and that no errors are being reported.
  2. My code reports no errors when run with compute-sanitizer or cuda-memcheck (one or the other, depending on your GPU type). If errors are reported at this step, I would probably use the method described here to localize those errors, in an effort to sort them out.

After completing those steps successfully, I don’t think you’ll have any trouble using in-kernel printf, to proceed with “typical debug”.

A tutorial on basic CUDA debugging is given in session 12 of the online training series I had previously mentioned in my post on November 13th in this thread