Constructing std::function with a __global__ kernel

I am new to cuda programming.
I have a global kernel function:

extern __global__ void cube(float * d_out, float * d_in){
	int id = blockIdx.x + threadIdx.x;
	float f = d_in[id];
	d_out[id] = f*f*f;
}

I am trying to create a std::function object of this global square function.

nvstd::function<void(float*, float*)> kernel = square;

When I try to run this kernel on GPU by calling

kernel<<<1, NUM_THREADS>>>(d_out, d_in);

it gives an error : a host function call cannot be configured

If I instead use nvstd::function it gives the error : a device function call cannot be configured

What is the difference between the nvstd implementation and std functional?
Also, how can I store a global function as a function object? Why does it work if I store it in a function pointer instead?

nvstd::function is designed to be used with either lambdas or functions decorated with device

A kernel is neither of those.

https://devblogs.nvidia.com/new-compiler-features-cuda-8/

The function capture point determines where it can be used. A function captured in host code will have a host function address. A function captured in device code will have a device function address.

On top of all this, you’ve given a function definition for cube but are capturing a function called square, which you have not defined.