Same c++/cuda code base to work on demand on cpu OR gpu?

I’m working on raytracer.

I would like user to be able to choose from a UI if he wants the rendering to be processed on the CPU (if he doesn’t have any Nvidia video card for exemple) or on the GPU (if he has a Nvidia video card that supports CUDA).

I would like to have a single C++ exe and depending on a param passed to the exe (mode=cpu or mode=gpu), it would render on cpu or gpu.

First question :
If no nvidia adapter found, the C++ cuda kernels and methods flagged host device will be able to be executed or not ?

Second question :
Is it possible to have a single c++ cuda code base that can work on demand on cpu or on gpu by using host device methods eveywhere ?

Thanks a lot for your help

Eviral

Yes, a __host__ __device__ function can be called in an environment that

However you’ll need to address a number of other items to make it work, and as discussed next, that by itself is not a complete path forward for CPU vs. CPU+GPU execution. Those statements don’t apply to CUDA kernels.

That, by itself, will not work and is insufficient. At the CUDA C++ level, you do not dispatch work to a GPU using a __host__ __device__ method. The methodology for describing code that will run on a GPU (called a kernel in CUDA C++) uses the __global__ keyword, and a function marked that way cannot be run on a CPU.

There are a variety of questions (here is one) on various forums describing the top-level considerations for CPU/GPU alternate code paths. Having an entirely unified code path using purely CUDA C++ is not possible - you will need at least to some degree some separation in the code path for pure CPU execution vs. CPU+GPU execution.

There do exist technologies, such as C++ stdpar, and OpenACC (and others, such as numpy/cupy, thrust, etc.), that allow for a more-nearly-unified code path between CPU path and CPU+GPU path.

1 Like

Here’s a rough example of what you could do in CUDA C++: (not compiled or tested)

template <typename T>
__host__ __device__ void op(T *in, T *out){ *out = *in+2;}

template <typename T>
__global__ void gpu_dispatch(T *in, T *out, size_t n){

    for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=blockDim.x*gridDim.x)
      op(in+i, out+i);
}

template <typename T>
void cpu_dispatch(T *in, T *out, size_t n){

#pragma omp parallel for

  for (size_t i = 0; i < n; i++) 
     op(in+i, out+i);
}

using mt = int;
int main(){

  const int ds = 32768;
  std::vector<mt> data(ds, 1);
  bool use_gpu = (cudaSuccess == cudaSetDevice(0));
  if (use_gpu){
    mt *in, *out;
    cudaMalloc(&in,  sizeof(mt)*ds);
    cudaMalloc(&out, sizeof(mt)*ds);
    cudaMemcpy(in, data.data(), sizeof(mt)*ds, cudaMemcpyHostToDevice);
    gpu_dispatch<<<128,128>>>(in, out, ds);
    cudaMemcpy(data.data(), out, sizeof(mt)*ds, cudaMemcpyDeviceToHost);
    }
  else {
    cpu_dispatch(data.data(), data.data(), ds);
  }
}

You would compile that with nvcc (e.g. nvcc test.cu -o test -Xcompiler -fopenmp -lgomp), which statically links to libcudart by default. The executable built that way should be able to run on a machine with or without a GPU.

Besides the technical questions about having a single code base run on both platforms; would your raytracer run fast enough on CPU.

Would you need special CPU optimizations not done by the compiler?

Yes, my cpu implementation is not fast enough for me when rendered with multi threading on cpu.
I have tried to optimize as much as possible my code, it’s far better than at the begining but still not fast enough for me.
That’s why i’m porting my raytracer (based on the book raytracer in one weekend) to cuda for faster rendering on good RTX video cards.

The time benefit is really good (x10 faster compared to cpu).

Sharing as much as possible classes between cpu and gpu is very important, having 2 different code base is just hell

Thanks for your advices :)

the “raytracing in one weekend” has a corresponding GPU version available for study.

And it mentions how one could share code between GPU and CPU in the “Adding Vectors” chapter.

Yes i know ! :)

Yes, that’s what i’m trying to do, i try to share as many utilities classes as possible between cpu and gpu.

I started with GitHub - Belval/raytracing: Using CUDA to implement "Raytracing in one weekend" by Peter Shirley as a simple working base and i added all my new features (bump, normal, displace textures, phong, orennayar, anisotropic materials, fast bvh, obj import rendered as triangles…)

My big problem for the moment to share classes between gpu and cpu is because i had to pass curandState* local_rand_state to many many methods to be able to generate randomness.
I’m trying to make a wrapper above curand to be able to pass my wrapper instance instead of curandState to have an random number generation implem for gpu and another one for cpu.

If you don’t need all the variety/specifics of curand, you could use e.g. a thrust random generator (and documentation). Only a small number of changes (container type, execution policy) would be needed to switch from host generation to device generation.