Cuda CPP vs C


Is there any significant difference in performance between CPP or C based Cuda kernel/program/whatever?

It is not at all clear what you are asking. For device code, CUDA is basically a subset of C++ with some minimal extensions. Host code gets passed to the host compiler (gcc, clang, or MSVC depending on platform).

However, host code is parsed and pre-processed before being passed to the host compiler, and this can lead to performance differences in the host code, compared with compiling the same code directly with the host compiler. As far as I can tell, these differences are due to artifacts in the host compiler’s code generation, i.e. code that is functionally identical but expressed slightly different causes different machine code to be generated. I have, in the past, observed performance degradations of up to 10% in host code when compiled via nvcc, but I haven’t looked into the issue for a number of years.

My general advice for any non-trivial application would be to keep nearly all host code in .cpp files that are compiled directly by the host compiler, and move to .cu files only that host code that is needed to interface with the device.