I was trying to implement a Generic Class using Cuda for common algorithms like Reduce or Scan providing some pre processing such as a simple map inside the algorithm. This map operations are carried out before the actual reduce/scan algorithm. In order to realize this I was hoping to use the lambda functions. Following is the way I tried to implement this.

```
template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
//Actual Reduce Algorithm Comes here
}
template<typename T, typename LAMBDA>
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
lam();
ReduceOperationPerThread(d_in, d_out, size);
}
```

The helper function that invokes this kernel is created as follows,

```
template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
// preparing block sizes, grid sizes
// and additional logic for invoking the kernel goes here
// with the Kernel invocation as following
ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}
```

All of the above code is included in a source named Reduce.cu and the respective header is created as Reduce.h as following

```
// Reduce.h
template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam);
```

So at the end of the day the complete Reduce.cu looks like this,

```
// Reduce.cu
template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
//Actual Reduce Algorithm Comes here
}
template<typename T, typename LAMBDA>
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
lam();
ReduceOperationPerThread(d_in, d_out, size);
}
template<typename T, typename LAMBDA>
void ReduceWPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
// preparing block sizes, grid sizes
// and additional logic for invoking the kernel goes here
// with the Kernel invocation as following
ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}
```

But the problem I’m having is related to writing template functions in separate .h and .cu files

In normal cases where lambda functions are not used, what I used to do was adding all the possible implementations of the function with possible values for template parameter at the end of the .cu file as mentioned in https://isocpp.org/wiki/faq/templates#separate-template-fn-defn-from-decl-export-keyword, under FAQ - “How can I avoid linker errors with my template classes?”

```
// At the end of the Reduce.cu file
// Writing functions with possible template values
// For A normal Reduce function
template void Reduce<double>(double * d_in, double * d_out, unsigned int size);
template void Reduce<float>(float * d_in, float* d_out, unsigned int size);
template void Reduce<int>(int * d_in, int * d_out, unsigned int size)
```

But in this case possible value for template parameter LAMBDA cannot be predefined.

```
template void ReduceWPreprocessing<int>(int * d_in, int * d_out, unsigned int size, ??? lambda);
```

Is there another way to use lambda functions for this kind of applications?

The answer I got from stackoverflow was that there is no way to use the lambda functions along with the template parameters