Using Lambda functions in CUDA with template functions

I was trying to implement a Generic Class using Cuda for common algorithms like Reduce or Scan providing some pre processing such as a simple map inside the algorithm. This map operations are carried out before the actual reduce/scan algorithm. In order to realize this I was hoping to use the lambda functions. Following is the way I tried to implement this.

template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
    //Actual Reduce Algorithm Comes here 
}

template<typename T, typename LAMBDA> 
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    lam();

    ReduceOperationPerThread(d_in, d_out, size);
}

The helper function that invokes this kernel is created as follows,

template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    // preparing block sizes, grid sizes
    // and additional logic for invoking the kernel goes here
    // with the Kernel invocation as following

    ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}

All of the above code is included in a source named Reduce.cu and the respective header is created as Reduce.h as following

// Reduce.h
template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam);

So at the end of the day the complete Reduce.cu looks like this,

// Reduce.cu
template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
    //Actual Reduce Algorithm Comes here 
}

template<typename T, typename LAMBDA> 
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    lam();

    ReduceOperationPerThread(d_in, d_out, size);
}

template<typename T, typename LAMBDA>
void ReduceWPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    // preparing block sizes, grid sizes
    // and additional logic for invoking the kernel goes here
    // with the Kernel invocation as following

    ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}

But the problem I’m having is related to writing template functions in separate .h and .cu files

In normal cases where lambda functions are not used, what I used to do was adding all the possible implementations of the function with possible values for template parameter at the end of the .cu file as mentioned in https://isocpp.org/wiki/faq/templates#separate-template-fn-defn-from-decl-export-keyword, under FAQ - “How can I avoid linker errors with my template classes?”

// At the end of the Reduce.cu file
// Writing functions with possible template values 
// For A normal Reduce function

template void Reduce<double>(double * d_in, double * d_out, unsigned int size);
template void Reduce<float>(float * d_in, float* d_out, unsigned int size);
template void Reduce<int>(int * d_in, int * d_out, unsigned int size)

But in this case possible value for template parameter LAMBDA cannot be predefined.

template void ReduceWPreprocessing<int>(int * d_in, int * d_out, unsigned int size, ??? lambda);

Is there another way to use lambda functions for this kind of applications?

The answer I got from stackoverflow was that there is no way to use the lambda functions along with the template parameters

I don’t think this question has anything to do with CUDA, does it?

How would you perform this in ordinary C++ ?

If you don’t know, maybe what you should do is remove the CUDA aspects of this question, and post it on a C++ forum like SO where there are approximately 5 Billion C++ programmers looking at it.

Oh, here’s your problem right here. Yeah, you can’t do that in C++. In C++, templates are incomplete types. They basically give the compiler enough information to someday generate a complete type but for the time being, the compiler can’t do much with a template. There’s no way to actually create a generic definition in a source file. Only specializations can be put into their own source file.

This means all your templated code needs to be in the header file.

To pass a lambda, be horribly insecure and just do this:

template <typename T, typename F>
auto map(T* array_of_data, F functor> -> T*;

map<int>([0, 1, 2], [](int i) -> int { return i + 1; });

@txbob,

If it was a normal c++ problem I would have implemented everything on header file. But in this case I cannot do that since the header might be included in a c++ source files where they are compiled by a default c++ compiler which doesn’t identify Cuda keywords

you can use two header files, or single file with checking the CUDA_ARCH:

#ifdef __CUDA_ARCH__
nvcc compiles here
#else
host compiles here
#endif