Best practices for CPU vs GPU programming in C?

So, I’ve been able to think of a number of naïve ways to accomplish this, but does anyone know of any resources/documents that apply towards CPU vs GPU programming in C ?

By this, I mean lets say my program had a matmul function-- However I don’t know, or presume that the person using it has a CUDA enabled device.

I know how to perform the needed detection as the program starts, and I guess, if found I could just set some necessary flags, and then in the matmul function have a if/else or switch for how the operator is performed (i.e. I want to avoid having a matmulCPU function and a matmulGPU function, and of course the detection part would only be done once at the start).

I just didn’t know whether there was a more elegant/preferred way of handling this type situation ?

This may be of interest, including the links. I’m not sure exactly what constitutes “elegant/preferred”, but the stdpar and OpenACC methods have various benefits: concise expression, portability, CPU or GPU operation, mostly unified code paths, etc.

@Robert_Crovella thanks, I will have a look.

By more ‘elegant/preferred’, I just meant the easiest way would to just create global variables to hold the GPU configuration (and whether absent or present); But these days using global vars is mostly seen as a bit ‘verboten’.

At the same time passing configuration points to many disparate functions when they will not be used could start to seem tedious;

That is just why I wondered if there was some other way or to know what is seen as best practice.

Please do not forget that for many kernels copying from host (CPU) memory to device (GPU) memory and back is the limiting factor, if done for each operation.

So think about a way to use classes (if C++) or e.g. function pointers in structs (C) to abstract the location of your data.

That is perhaps even more critical for seamless operations than that the kernels are called. If you forget one kernel call, than your program may be slow, when the data normally resides on CPU; but if you move the data to GPU, you potentially would get UB, when accessing.

Managed memory is a possible solution, but it is slow compared to more direct implementations.

@Curefab thank you for your perspective and ‘pointers’.

In this case it regards a small community project of which I am not the lead, and C has been selected as the chosen language (honestly I’d much have a ‘real’ full OOP environment), so was just trying to work out/through my options.

Best,
-A

There are some C projects with better abstraction than an average C++ program. But it is far less nice.

I would distinguish from the abstraction of the operations (device functions running on the GPU) and abstraction of calling those functions (preparatory work on CPU and perhaps the global function on the GPU itself).

For the device functions I would try to share as much code 1:1 as possible.

For the dispatch code (preparatory work and global functions) I would try to make it as similar as possible between each function, which is called in the end. Either use conditionals or macros or code generation or write manually.