What's the general opinion on launching kernels from class methods?

I have a codebase which starts “life” as a C++ codebase, and can be compiled as such for nearly the same functionality but on a single CPU thread. When CUDA is present, one can compile the more sophisticated form. Until recently, the C++ part of the code was way ahead, but this was necessary to lay out the general shape of things.

Now I’m starting to write functions to serve the GPU side of objects. Earlier, we covered the topic of including __global__ function primitives in header files, and I decided that I will write a .cuh file containing those primitives in addition to the .h file containing the primitives for extern functions that launch those kernels. In a more general sense, I’m now writing kernels to accomplish on the GPU what many of my housekeeping C++ functions accomplish on the host for various objects in my code.

My question is whether I should include the functions that launch the corresponding __global__ kernels as additional member functions of the class, which are only compiled when CUDA is present, or if I should leave the launchers as free functions which the existing class functions call through some new branch that gets compiled when CUDA is present. I don’t know about making the actual kernels, the __global__ functions themselves, class members. But something like this is what I am thinking:

In a .cpp file:

class MyClass {

  Hybrid<long long int> accumulation;
  Hybrid<int> storage;

#ifdef HAVE_CUDA

MyClass::MyClass(size_t num_ints) :
  accumulation{1, "sum_of_all_entries"},
  storage{num_ints, "list_of_integers"}

MyClass::sumStorage() {
#ifdef HAVE_CUDA
  long long int st_sum = 0LL;
  for (size_t i = 0; i < storage.size(); i++) {
    st_sum += static_cast<long long int>(storage.readHost(i));
  accumulation.putHost(st_sum, 0);

Then, in a subsequent .cu file:

__global__ void kSumStorage(const int* storage, const size_t length, const long long int *result) {
  __shared__ long long int warp_sums[32];
  long long int block_sum = 0LL;
  for (size_t pos = threadIdx.x; pos < length; pos += blockDim.x * gridDim.x) {
    block_sum += (long long int)(storage[pos]);
  // Reduce block_sum over warps...
  if ((threadIdx.x & 0x1f) == 0) {
    warp_sums[threadIdx.x >> 5] = block_sum;
  // Reduce block_sum over the __shared__ entries...
  if (threadIdx.x == 0) {
    atomicAdd((unsigned long long int*)(result), (unsigned long long int)(block_sum));

extern void MyClass::launchKSumStorage() {
  cudaMemset(accumulation.data(HybridTargetLevel::DEVICE), 0, sizeof(long long int));
  kSumStorage(storage.data(HybridTargetLevel::DEVICE, storage.size(),

In the above, I’m writing the simple kernel somewhat off-the-cuff, but the idea should be clear. Hybrid is an object I use, reminiscent of std;:vector<T>, which holds data on both the host and device and offers the means to upload and download that data. Should I made the kernel launching functions member functions or free functions? The former would offer some degree of advantages as far as namespace tidiness, but it would seem to be coaxing me towards extern functions inside a class. I don’t know if this is possible, and it seems that the extern declaration is not necessarily required either now that I realize my code was compiling without such declarations when I thought that I had added them. And should I (can I even) have the __global__ kernel itself be a class member function?

Thanks as always,