I have some existing serial C/C++ code with lots of subroutine and classes written. Now, I want to take 1,000 kernel threads and let each of them run exactly this code with different parameters. These threads are completely independent. What is the best way of doing it?
For a over simplified example, say, I have a foo(int x) written 10 years ago in C/C++, and now I want to run it inside CUDA kernel as foo(threadIdx.x), so that each thread computes foo() in parallel embarrassingly.
If foo(int x) is as simple as a single function, I figure I can redefine it in the kernel.cu with device modifier. However, my existing C/C++ code is rather large and contains more than 30 .cpp files and various file accesses. Do I have to rewrite everything into the kernel? If so, could it be as simple as adding device modifier to all of the subroutines?
Any input will be great! Thanks in advance.