How to porting a rather complicated c++ code (Geant4) into GPU ??

Hi, I’m pretty new to CUDA programming and I’m having a problem trying to port a part of Geant4 code into GPU.

Geant4 is a particle simulation tool based on c++ program.

The main reasons why we think it difficult is as following:

  1. Geant4 simulation uses c++ instead of c programming.

  2. Geant4’s program structure is a multi-level class ( In other words, it uses class calls class method to
    complete the work }

  3. In order to let kernel function uses the class function we need to add Tag device host before each function. Could anyone gives some samrt ideas to add this Tag in front of every class’s member function ?

  4. Geant4’s each class uses many class pointer variable. Hence, we need to use CUDA Unified Memory
    to handle it ( this is a rather difficult part)

And we have use a simple c++ program to test for it

#include <iostream>
    #include <cuda_runtime.h>
    using namespace std;
     // Derived class
     class Rectangle
          int getArea()
             return (*width * *height);
          int* width;
          int* height;
    // Base class
    class Shape
        Rectangle* rect;
    __global__ void change_width(Shape* sha)
        *(sha->rect->width) = 210;
        *(sha->rect->height) = 10;
    int main(void)
        Shape* sha;
        cudaMallocManaged(&sha->rect, sizeof(Rectangle));
        *(sha->rect->width) = 20;
        *(sha->rect->height) = 10;

       // Print the area of the object.
       cout << "Total area: " << sha->rect->getArea() << endl;
       return 0;

In this code, we can port a 2-level class structure into GPU. And it works.
But you know , Geant4 is a rather big project. I am afraid that our simple idea could not fulfill the job.

Could anyone with good experience in porting c++ code to GPU give some good advice to me??
I really don’t have idea how to handle for such a big program.
Thanks for all your help.


I think you should focus on the topic which parts of the large code base will benefit from execution on a GPU, and which parts don’t. Profile the CPU code to see which loops and function are critical ones. Focus on these.

However, often it is easier to design a project for the GPU from the ground up, instead of retroactively trying to fit an existing CPU based solution onto the CUDA GPU.

Probably a 1:1 mapping of Geant’s C++ classes and objects into CUDA is not the best way to go forward. CUDA benefits greatly from “Struct of Arrays” data storage, as opposed to arrays or lists of structs (or objects). SoA storage allows for coalesced access to the main memory, therefore allowing memory access speeds at nearly the full hardware capability. Also consider using textures and constant for read-only data storage. This will use of the GPU’s texture caches on any localized reads.

What’s generally not working well with CUDA is trying to squeeze a gazillion of classes and code into one Mega-Kernel. You will run out of registers quickly and as a result performance suffers so badly that you lose all benefits of execution on the GPU. You will most likely have to split the computation into several smaller kernels.


I think Christian is correct regarding first determining which pieces of the code will benefit from being run on the GPU.

As for implementation, you should refer to the C++ example in the CUDA samples. It demonstrates how to call kernels from inside C++. The sample is called “CppIntegration” and is inside the “0_Simple” folder in the CUDA samples.