Passing lambda functions as arguments to kernels

Hello,

This is Mike. I’m a Physics PhD student and am trying to use CUDA for one of my scientific computing research projects.

The core of the problem I met is like this:

There is a function double f(double x,double y), I’d like to fix y as some constant variable to make f a single-variable function, and then pass it to some custom operations such as integral(f,0,10), where the function “integral” accepts a single-variable function as an input argument.

In C++11 this is rather simple to implement, for instance I can either use say std::bind(f,_1,y) to fix the y, or use a lambda function [=](double x){return f(x,y);} to create a closure. Afterwards I’ve got a single-variable function I can use for operations such as integral. Like:

double y=10;
auto f_fixed=[=](double x){return f(x,y);}
integral(f_fixed,0,10);

However, I met some problems when trying to implement this in CUDA (I’m planning use CUDA to accelerate the “operation on f” part, such as parallel acceleration for integration). I’m currently using CUDA 9.0 on a TitanXp, and the IDE I’m using is Visual Studio 2017.

To be able to define the lambda function, one way to do this is on the device side (that is, in a global kernel). But that means, to perform the integral, I would need to call “integral” as another child-kernel in the parent-kernel for dynamic parallelism. For example:

__device__ float f(float x, float y) {
	return x*y;
}

__global__ void child_launch(float *data, const nvstd::function<float(float)> &f) {

        //here it's simply applying f, but in reality one could say use data and f to perform integral in parallel
	data[threadIdx.x] = f(data[threadIdx.x]);

}

__global__ void parent_launch(float *data) {

        //just a serial single thread on device

        float y=10;

	__syncthreads();

	if (threadIdx.x == 0) {

		auto f_fixed = [=] __device__(float x) {
			return f(x,y);
		};

		child_launch <<< 1, 256 >>>(data,f_fixed);
		cudaDeviceSynchronize();
	}

	__syncthreads();
}

This gives me the error of either “cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch” (if f_fixed is passed by copy to child_launch) or “a pointer to local memory cannot be passed to a launch as an argument” (if f_fixed is passed by reference to child_launch, as shown here). In fact, even a normal device function cannot be passed into child_launch as an argument, and the compiler returns the same “pointer to local memory cannot be passed to a launch as an argument” error.

Does this mean that, child-kernels cannot accept functions (either user-defined functors, or device functions) as their arguments?

Alternatively, another approach would be building the lambda function (or similar operations as std::bind) in host code, since the latest CUDA provides support for device lambdas declared in-line in host code. But that would bring another problem, since host code cannot simply access the device function f, but if f is a host function, then the on-device kernel wouldn’t be able to recognize it after it’s passed back to the kernel from host.

Therefore, may I ask if it is possible to somehow access a device f function from host side, wrap it (either with lambda or bind) in closure, and then pass it back to a global kernel on device? (Or, alternatively, write a host function, wrap it in lambda, and somehow copy the wrapped function to the device side?)

Thank you in advance!

Yours Sincerely,
Mike

You could do it like this to create a device-side lambda calling a device function on the host.

#include <cstdio>

template<class F>
__global__
void kernel(F f){
	f(threadIdx.x);
}

__host__ __device__
void function_hostdevice(int i, int k, const char* s){
	printf("%s, %d %d\n", s, i, k);
}

__device__
void function_onlydevice(int i, int k, const char* s){
	printf("%s, %d %d\n", s, i, k);
}


int main(){
	int k = 13;
	
	auto func_gpu = [=] __device__ (int i){
		function_hostdevice(i, k, "gpu");
	};
	
	auto func_cpu = [=] (int i){
		function_hostdevice(i, k, "cpu");
	};	
	
	for(int i = 0; i < 10; i++)
		func_cpu(i);
		
	kernel<<<1,32>>>(func_gpu);
	
	cudaDeviceSynchronize();
	
	auto func_gpu2 = [=] __device__ (int i){
		function_onlydevice(i, k, "gpu");
	};
	
	kernel<<<1,32>>>(func_gpu2);
	
	cudaDeviceSynchronize();
}

Many thanks for striker159’s very timely and kind reply!

That sounds great! Indeed I didn’t know that host device functions will automatically decide the host/device counterpart, depending on whether it’s a device lambda that’s called, and also that device lambda can call device functions, guess as a beginner I’ll have a lot to learn!

I’ll try it out right away, and if this works then it would already satisfy my need in the project (rather than having to use nested kernels) :D Thanks again for the help!

Also, just out of curiosity, may I ask if kernels launched inside other kernels (i.e. child kernels in dynamic parallelism) cannot accept function pointers and lambdas? (I think I’ll give it a try to use the template method you suggested, rather than nvstd’s implementation, to see if the child kernel would accept a lambda this time.)

Just a quick follow-up:

  1. Yes, simply using a device lambda to call a host device function in the host code works great! (That’s probably because, inside the device lambda definition, the code is considered “device” code, despite that it’s in the host main function, so it can indeed call device functions, and a host device function would automatically pick the device version here).

  2. I’ve also tried the case with child kernels. It seems that the parent kernel can both pass a device function to the child kernel, and also pass a lambda defined in the parent kernel. For instance:

__host__ __device__ float f(float x) {
	return x*x;
}

template<class F>
__global__ void child_launch(float *data, F f) {
	if(threadIdx.x < 10)
		data[threadIdx.x] = f(data[threadIdx.x]);
}

__global__ void parent_launch(float *data) {

	__syncthreads();

	if (threadIdx.x == 0) {

		auto lambda = [=] __device__(float x) {
			return f(x);
		};

		child_launch << < 1, 32 >> >(data, lambda);
		//child_launch << < 1, 32 >> >(data, f);
		cudaDeviceSynchronize();
	}

	__syncthreads();
}

This is probably thanks to striker159’s kind suggestion of using template (rather than using the nvstd function implementation) to pass the function as an argument, but yes, I guess just out of curiosity, the same thing works with child kernels called from parent kernels too.

Many thanks again for striker159’s kind help!

Mike