I would like user to be able to choose from a UI if he wants the rendering to be processed on the CPU (if he doesn’t have any Nvidia video card for exemple) or on the GPU (if he has a Nvidia video card that supports CUDA).
I would like to have a single C++ exe and depending on a param passed to the exe (mode=cpu or mode=gpu), it would render on cpu or gpu.
First question :
If no nvidia adapter found, the C++ cuda kernels and methods flagged hostdevice will be able to be executed or not ?
Second question :
Is it possible to have a single c++ cuda code base that can work on demand on cpu or on gpu by using host device methods eveywhere ?
Yes, a __host____device__ function can be called in an environment that
However you’ll need to address a number of other items to make it work, and as discussed next, that by itself is not a complete path forward for CPU vs. CPU+GPU execution. Those statements don’t apply to CUDA kernels.
That, by itself, will not work and is insufficient. At the CUDA C++ level, you do not dispatch work to a GPU using a __host____device__ method. The methodology for describing code that will run on a GPU (called a kernel in CUDA C++) uses the __global__ keyword, and a function marked that way cannot be run on a CPU.
There are a variety of questions (here is one) on various forums describing the top-level considerations for CPU/GPU alternate code paths. Having an entirely unified code path using purely CUDA C++ is not possible - you will need at least to some degree some separation in the code path for pure CPU execution vs. CPU+GPU execution.
There do exist technologies, such as C++ stdpar, and OpenACC (and others, such as numpy/cupy, thrust, etc.), that allow for a more-nearly-unified code path between CPU path and CPU+GPU path.
Here’s a rough example of what you could do in CUDA C++: (not compiled or tested)
template <typename T>
__host__ __device__ void op(T *in, T *out){ *out = *in+2;}
template <typename T>
__global__ void gpu_dispatch(T *in, T *out, size_t n){
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=blockDim.x*gridDim.x)
op(in+i, out+i);
}
template <typename T>
void cpu_dispatch(T *in, T *out, size_t n){
#pragma omp parallel for
for (size_t i = 0; i < n; i++)
op(in+i, out+i);
}
using mt = int;
int main(){
const int ds = 32768;
std::vector<mt> data(ds, 1);
bool use_gpu = (cudaSuccess == cudaSetDevice(0));
if (use_gpu){
mt *in, *out;
cudaMalloc(&in, sizeof(mt)*ds);
cudaMalloc(&out, sizeof(mt)*ds);
cudaMemcpy(in, data.data(), sizeof(mt)*ds, cudaMemcpyHostToDevice);
gpu_dispatch<<<128,128>>>(in, out, ds);
cudaMemcpy(data.data(), out, sizeof(mt)*ds, cudaMemcpyDeviceToHost);
}
else {
cpu_dispatch(data.data(), data.data(), ds);
}
}
You would compile that with nvcc (e.g. nvcc test.cu -o test -Xcompiler -fopenmp -lgomp), which statically links to libcudart by default. The executable built that way should be able to run on a machine with or without a GPU.
Yes, my cpu implementation is not fast enough for me when rendered with multi threading on cpu.
I have tried to optimize as much as possible my code, it’s far better than at the begining but still not fast enough for me.
That’s why i’m porting my raytracer (based on the book raytracer in one weekend) to cuda for faster rendering on good RTX video cards.
The time benefit is really good (x10 faster compared to cpu).
Sharing as much as possible classes between cpu and gpu is very important, having 2 different code base is just hell
My big problem for the moment to share classes between gpu and cpu is because i had to pass curandState* local_rand_state to many many methods to be able to generate randomness.
I’m trying to make a wrapper above curand to be able to pass my wrapper instance instead of curandState to have an random number generation implem for gpu and another one for cpu.
If you don’t need all the variety/specifics of curand, you could use e.g. a thrust random generator (and documentation). Only a small number of changes (container type, execution policy) would be needed to switch from host generation to device generation.