Automatic Acceleration if GPU is Available?

Hey guys, I’m currently working on enhancing an existing library my team has put together in C++ called GSAP (Generic Software Architecture for Prognostics,

I won’t get too much into the details of the library itself, but most of our prognostic problems have a model associated with them we use in monte carlo simulation. In an attempt to produce more accurate results and also speed up the GSAP models, I’m looking to use CUDA to run the particles from the simulation in parallel. As of now I’ve had success with porting one of our models over to CUDA and had fantastic performance results.

The problem here lies in the need to create separate modules for each prognostic model. This effectively doubles the work each time one of the models is updated, or a new one is created as a model will now have both GPU and CPU versions. My group has collectively decided we’re better off trying to find a way around this, ideally keeping our library to a single module per model and somehow combining the GPU and CPU models.

I’ve done a bit of looking around on this topic and the only thing I could find that was really related was this article here:

I was able to extend this a bit and use a bash script to determine if the machine in which compilation takes place on has a CUDA enabled device, pointing to different makefile targets in which the GPU target uses compiler directives to select the proper model version. Unfortunately while this is a solution to the problem of dynamically deciding which model to use (GPU if possible, otherwise just default to CPU), it is not a solution to the issue of requiring multiple modules for each model version.

My question to you guys is this an existing issue that others have had to work around, and if so what is the recommended approach here? Hopefully what I’m looking to do has been described well enough above but if you’d like me to expand at all please let me know.


In a project of mine, I’ve come up with the following.
I put the logic of the simulation in a host device function.
Then, inside this function I used the macro CUDA_ARCH to choose between normal for-loop in the cpu version and a grid-strided CUDA loop in the device version. Finally, I can either call this function directly (host version) or inside a kernel (device version).

Something like this:

#ifdef __CUDACC__
    #define HOSTQUALIFIER __host__
    #define DEVICEQUALIFIER __device__
    #define HOSTDEVICEQUALIFIER __host__ __device__
    #define KERNEL __global__
    #define KERNEL

void calculate(...parameters...){
#ifdef __CUDA_ARCH__
	// on the device, we use the global thread Id to index the data
	for(int index = blockIdx.y * blockDim.y + threadIdx.y; index < N; index += blockDim.y * gridDim.y){
	// on the host, we use OpenMP to parallelize looping over cosines
	#pragma omp parallel for
	for(int index = 0; index < N; index++){
		// do calculation


void calculateKernel(...parameters...){	

void calculate_cpu(...parameters...){

#ifdef __CUDACC__
void calculate_gpu(...parameters...){
	calculateKernel<<<grid, block, smem, stream>>>(...parameters...);

That’s exactly what I was looking for, I had no idea you could actually use both attributes to get a function into both the GPU and CPU compilation pipelines. Thank you very much!

As a short followup question, does this allow you to put everything in a .cpp file instead of .cu, and still be compiled into a GPU build with NVCC? To my understanding it’ll get split into host and device compilation pipelines, so it won’t cause problems on a system without a device as that pipeline never gets created, but it will compile the GPU version just fine on CUDA-enabled sytems, even without the extension being .cu?

Update: Found what I was looking for, the -x flag can be used to treat cpp files as if they were .cu! Awesome!

OpenACC sounds like a good fit if your goal is to avoid code duplication.
It is available e.g. in the GNU GCC compiler or PGI compiler.