Hello,
This is Mike. I’m a Physics PhD student and am trying to use CUDA for one of my scientific computing research projects.
The core of the problem I met is like this:
There is a function double f(double x,double y), I’d like to fix y as some constant variable to make f a single-variable function, and then pass it to some custom operations such as integral(f,0,10), where the function “integral” accepts a single-variable function as an input argument.
In C++11 this is rather simple to implement, for instance I can either use say std::bind(f,_1,y) to fix the y, or use a lambda function [=](double x){return f(x,y);} to create a closure. Afterwards I’ve got a single-variable function I can use for operations such as integral. Like:
double y=10;
auto f_fixed=[=](double x){return f(x,y);}
integral(f_fixed,0,10);
However, I met some problems when trying to implement this in CUDA (I’m planning use CUDA to accelerate the “operation on f” part, such as parallel acceleration for integration). I’m currently using CUDA 9.0 on a TitanXp, and the IDE I’m using is Visual Studio 2017.
To be able to define the lambda function, one way to do this is on the device side (that is, in a global kernel). But that means, to perform the integral, I would need to call “integral” as another child-kernel in the parent-kernel for dynamic parallelism. For example:
__device__ float f(float x, float y) {
return x*y;
}
__global__ void child_launch(float *data, const nvstd::function<float(float)> &f) {
//here it's simply applying f, but in reality one could say use data and f to perform integral in parallel
data[threadIdx.x] = f(data[threadIdx.x]);
}
__global__ void parent_launch(float *data) {
//just a serial single thread on device
float y=10;
__syncthreads();
if (threadIdx.x == 0) {
auto f_fixed = [=] __device__(float x) {
return f(x,y);
};
child_launch <<< 1, 256 >>>(data,f_fixed);
cudaDeviceSynchronize();
}
__syncthreads();
}
This gives me the error of either “cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch” (if f_fixed is passed by copy to child_launch) or “a pointer to local memory cannot be passed to a launch as an argument” (if f_fixed is passed by reference to child_launch, as shown here). In fact, even a normal device function cannot be passed into child_launch as an argument, and the compiler returns the same “pointer to local memory cannot be passed to a launch as an argument” error.
Does this mean that, child-kernels cannot accept functions (either user-defined functors, or device functions) as their arguments?
Alternatively, another approach would be building the lambda function (or similar operations as std::bind) in host code, since the latest CUDA provides support for device lambdas declared in-line in host code. But that would bring another problem, since host code cannot simply access the device function f, but if f is a host function, then the on-device kernel wouldn’t be able to recognize it after it’s passed back to the kernel from host.
Therefore, may I ask if it is possible to somehow access a device f function from host side, wrap it (either with lambda or bind) in closure, and then pass it back to a global kernel on device? (Or, alternatively, write a host function, wrap it in lambda, and somehow copy the wrapped function to the device side?)
Thank you in advance!
Yours Sincerely,
Mike