How do I compile __global__ kernels with a class?

I am working on a class that uses CUDA and I need a couple custom kernels (i.e. global void functions), but I can’t figure out where to place these functions.

All of the examples I have seen online (e.g. http://devblogs.nvidia.com/parallelforall/separate-compilation-linking-cuda-device-code/) put these kernels in the main.cpp file, but I cannot do this with my project.

How can I include them along side my class and compile it all into a .o file I can then link against the main executable?

EDIT: to clarify a bit, my program looks like the following:
cuda class (used in main program), compiled to .o with nvcc
main program, compiled with icc and linked against libraries

I need the cuda class to be able to call kernels (i.e. global void functions), but don’t know where to place these… I cannot place them in the main program.

I’m afraid following codes don’t help you…

// ----- foo.h
class foo {
  unsigned int size_;
public:
  foo(unsigned int size) : size_(size) {}
  void add(int* c, const int* a, const int* b);
};
// ----- foo.cu
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

__global__ void addKernel(int *c, const int *a, const int *b);

void foo::add(int* c, const int* a, const int* b) {
  addKernel<<<1,size_>>>(c, a, b);
}
// ----- addKernel.cu
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

__global__ void addKernel(int *c, const int *a, const int *b) {
  int i = threadIdx.x;
  c[i] = a[i] + b[i];
}
// ----- main.cpp
#include "foo.h"
int main() {
   const unsigned int N = 10;
   int* dev_a;
   int* dev_b;
   int* dev_c;
   ....
   foo aFoo(N);
   aFoo.add(dev_c, dev_a, dev_b);
   ...
}

You say “I wanna call kernel-func. in .cpp” ?
so,

// ----- foo.cpp
...

void foo::add(int* c, const int* a, const int* b) {
  // following codes are equivalent to:  addKernel<<<1,size_>>>(c, a, b);
  void* args[] = { &c, &a, &b };
  cudaLaunchKernel<void>(&addKernel, 1, size_, args);
}

Thanks for the posts. I don’t know what happened to my post (they were disappearing or hidden, but I think that’s fixed now) but I did find what I needed in the CppIntegration example inside the CUDA package.

I accomplished it doing essentially what you posted in the first post. I added a wrapper external “C” void function to call the kernel, then included that header in the class template, then included that header class template in the main program and it compiled and works fine. Thanks again.