Pass function and parameters to another function to execute later in GPU using CUDA

I am trying to pass a device function and its parameters to another host function so that I can execute it later in GPU using CUDA. The idea is to write different device functions and use the same host and global functions for all of them. For example,

__device__ void DeviceFunction(int* arr1, int* arr2)
{
    //Do something...
}

int main()
{
    HostFunction(DeviceFunction, arr1, arr2);
}

Lets assume our host function will call a __global__ function. And, finally, the global function will call the device function.

template<typename Tf, typename... T>
__host__ void HostFunction(Tf func, T... args)
{
    GlobalFunction <<< dimGrid, dimBlock >>> (func, args...);
}

template<typename Tf, typename... T>
__global__ void  GlobalFunction(Tf func, T... args)
{
    func(args...);
}

I can achieve this goal by using a device function pointer. But the problem is that the performance of the function pointer is very low in GPU. The next thing I tried using the Lambda expression.

//Not an extended Lambdas
auto DeviceFunction = [] __device__ (int* arr1, int* arr2)
{
    //Do something...
};

int main()
{
    HostFunction(DeviceFunction, arr1, arr2); //Not working
}

If I use the Lambda expression inside a function it will work but I need to define my device function outside of other functions independently.

int main()
{
    auto DeviceFunction = [] __device__ (int* arr1, int* arr2)
    {
        //Do something...
    };
    HostFunction(DeviceFunction, arr1, arr2); //Working
}

I am thinking about whether there are other ways to achieve the above goal and get almost the same performance when I use the device function call directly from the global function using DeviceFunction(args...); instead of func(args...);