I am trying to pass a device function and its parameters to another host function so that I can execute it later in GPU using CUDA. The idea is to write different device functions and use the same host and global functions for all of them. For example,
__device__ void DeviceFunction(int* arr1, int* arr2)
{
//Do something...
}
int main()
{
HostFunction(DeviceFunction, arr1, arr2);
}
Lets assume our host function will call a __global__
function. And, finally, the global function will call the device function.
template<typename Tf, typename... T>
__host__ void HostFunction(Tf func, T... args)
{
GlobalFunction <<< dimGrid, dimBlock >>> (func, args...);
}
template<typename Tf, typename... T>
__global__ void GlobalFunction(Tf func, T... args)
{
func(args...);
}
I can achieve this goal by using a device function pointer. But the problem is that the performance of the function pointer is very low in GPU. The next thing I tried using the Lambda expression.
//Not an extended Lambdas
auto DeviceFunction = [] __device__ (int* arr1, int* arr2)
{
//Do something...
};
int main()
{
HostFunction(DeviceFunction, arr1, arr2); //Not working
}
If I use the Lambda expression inside a function it will work but I need to define my device function outside of other functions independently.
int main()
{
auto DeviceFunction = [] __device__ (int* arr1, int* arr2)
{
//Do something...
};
HostFunction(DeviceFunction, arr1, arr2); //Working
}
I am thinking about whether there are other ways to achieve the above goal and get almost the same performance when I use the device function call directly from the global function using DeviceFunction(args...);
instead of func(args...);