C++ example to invoke same thread across multiple cores with different data set

Can anyone point me to tutorial or examples of how I can port my existing C/C++ threads to CUDA cores? Also is there specification how big, program code, not data, should be for optimal porting to CUDA ?

Most of the examples are on how to distribute the processing on big data set. Rather than distributing threads into Cores, sorry if I’m not making sense.

Thanks in advance.