kernel calling external C/C++ functions

Hi,

I have some existing serial C/C++ code with lots of subroutine and classes written. Now, I want to take 1,000 kernel threads and let each of them run exactly this code with different parameters. These threads are completely independent. What is the best way of doing it?

For a over simplified example, say, I have a foo(int x) written 10 years ago in C/C++, and now I want to run it inside CUDA kernel as foo(threadIdx.x), so that each thread computes foo() in parallel embarrassingly.

If foo(int x) is as simple as a single function, I figure I can redefine it in the kernel.cu with device modifier. However, my existing C/C++ code is rather large and contains more than 30 .cpp files and various file accesses. Do I have to rewrite everything into the kernel? If so, could it be as simple as adding device modifier to all of the subroutines?

Any input will be great! Thanks in advance.

Kai

Hi,

I have some existing serial C/C++ code with lots of subroutine and classes written. Now, I want to take 1,000 kernel threads and let each of them run exactly this code with different parameters. These threads are completely independent. What is the best way of doing it?

For a over simplified example, say, I have a foo(int x) written 10 years ago in C/C++, and now I want to run it inside CUDA kernel as foo(threadIdx.x), so that each thread computes foo() in parallel embarrassingly.

If foo(int x) is as simple as a single function, I figure I can redefine it in the kernel.cu with device modifier. However, my existing C/C++ code is rather large and contains more than 30 .cpp files and various file accesses. Do I have to rewrite everything into the kernel? If so, could it be as simple as adding device modifier to all of the subroutines?

Any input will be great! Thanks in advance.

Kai

I very much doubt your approach will work, for any number of reasons including:

[list=1]

CUDA only supports a subset of C++, and it is unlikely that classes with anything other than trivial constructors will work without modification or redesign

Unless you are using the latest Fermi GPU, there is no recursion or dynamic memory allocation permitted in device code (and even then there are restrictions on what is feasible)

GPU code complexity is usually limited by register pressure, and trying to port a long, complex OO code path directly into a kernel usually fails.

The GPU architecture is really a 32-wide SIMD architecture, and anything other than minor divergence between “threads” within the SIMD group causes serialization and performance degradation.

Even if serialization isn’t an issue in your code, 1000 threads is a tiny workload in absolute terms for a GPU, and the performance will be poor because there is insufficient parallelism to hide all the latencies in the GPU architecture.

The GPU has a number of hardware features like high speed on chip shared memory, hardware assisted texture and filtering, instruction level parallelism, and optimized vector memory access, just to name a few, and naïvely porting serial C code cannot leverage those, meaning per thread performance will probably be poor.

To me, it sounds like you are approaching this in completely the wrong fashion.

I very much doubt your approach will work, for any number of reasons including:

[list=1]

CUDA only supports a subset of C++, and it is unlikely that classes with anything other than trivial constructors will work without modification or redesign

Unless you are using the latest Fermi GPU, there is no recursion or dynamic memory allocation permitted in device code (and even then there are restrictions on what is feasible)

GPU code complexity is usually limited by register pressure, and trying to port a long, complex OO code path directly into a kernel usually fails.

The GPU architecture is really a 32-wide SIMD architecture, and anything other than minor divergence between “threads” within the SIMD group causes serialization and performance degradation.

Even if serialization isn’t an issue in your code, 1000 threads is a tiny workload in absolute terms for a GPU, and the performance will be poor because there is insufficient parallelism to hide all the latencies in the GPU architecture.

The GPU has a number of hardware features like high speed on chip shared memory, hardware assisted texture and filtering, instruction level parallelism, and optimized vector memory access, just to name a few, and naïvely porting serial C code cannot leverage those, meaning per thread performance will probably be poor.

To me, it sounds like you are approaching this in completely the wrong fashion.

The response from avidday lays out some good parameters for you to consider for looking at your existing code. One possible solution is to use those as a guide for identifying which parts of the code could benefit from CUDA. For the parts of the code that can not benefit from CUDA, you should consider using OpenMP. I wrote the Kappa library for exactly this reason–it does two things that you will find useful:

  1. Lets you mix in CUDA for the parts that can benefit from it and use existing or OpenMP code for the parts that run best on the CPU.
  2. Lets you take code (CUDA, OpenMP, serial, whatever) and run it in parallel. Kappa provides parallel data flow execution that can be specified (dynamically) using an index notation. The index notation lets you use an index to specify the execution threads with an indexed parameter set.

If I were you I would start with using option 2 with Kappa and then start optimizing particular subroutines and classes with CUDA and OpenMP.

The response from avidday lays out some good parameters for you to consider for looking at your existing code. One possible solution is to use those as a guide for identifying which parts of the code could benefit from CUDA. For the parts of the code that can not benefit from CUDA, you should consider using OpenMP. I wrote the Kappa library for exactly this reason–it does two things that you will find useful:

  1. Lets you mix in CUDA for the parts that can benefit from it and use existing or OpenMP code for the parts that run best on the CPU.
  2. Lets you take code (CUDA, OpenMP, serial, whatever) and run it in parallel. Kappa provides parallel data flow execution that can be specified (dynamically) using an index notation. The index notation lets you use an index to specify the execution threads with an indexed parameter set.

If I were you I would start with using option 2 with Kappa and then start optimizing particular subroutines and classes with CUDA and OpenMP.

I realized that while I may have answered the question in the content of your post, I did not directly talk about the question “kernel calling external C/C++ functions”.

I do not know of any method that lets you explicitly call CPU functions from GPU kernels. However, I specifically designed the Kappa library framework for implicitly doing what is as close as is possible to doing that. What I mean is that you can write GPU kernels that produce data (sets) which the Kappa framework scheduling language uses to control the scheduling and calling of either CPU or GPU kernels/functions and as data (sets) parameters to the other kernels/functions. In a sense, the fundamental use of the Kappa Value’s and of its index notation is to work with this controlling data, potentially from GPU or CPU calculations, to dynamically set up and change the flow of program execution.

(For even more fun, the GPU or CPU kernels could write C++ CUDA/OpenMP code which a CPU kernel would write to disk. Then, since Kappa has full support for JIT compiling that code, the code could be loaded and executed as part of the same process. Even simpler is to have GPU or CPU kernels write Kappa scheduling instructions which are then executed by the main program.)

If this is not clear enough about how to do this, then please contact me (contact information is available on the psilambda.com website).

I realized that while I may have answered the question in the content of your post, I did not directly talk about the question “kernel calling external C/C++ functions”.

I do not know of any method that lets you explicitly call CPU functions from GPU kernels. However, I specifically designed the Kappa library framework for implicitly doing what is as close as is possible to doing that. What I mean is that you can write GPU kernels that produce data (sets) which the Kappa framework scheduling language uses to control the scheduling and calling of either CPU or GPU kernels/functions and as data (sets) parameters to the other kernels/functions. In a sense, the fundamental use of the Kappa Value’s and of its index notation is to work with this controlling data, potentially from GPU or CPU calculations, to dynamically set up and change the flow of program execution.

(For even more fun, the GPU or CPU kernels could write C++ CUDA/OpenMP code which a CPU kernel would write to disk. Then, since Kappa has full support for JIT compiling that code, the code could be loaded and executed as part of the same process. Even simpler is to have GPU or CPU kernels write Kappa scheduling instructions which are then executed by the main program.)

If this is not clear enough about how to do this, then please contact me (contact information is available on the psilambda.com website).