Is it possible to pass stateful functions to kernel?

Hi everyone!

I’m writing library code and I want to let my users customize kernel behavior by passing in functions. I’ve worked out how to pass function pointers to kernels per Lei Mao’s blog post.

Now, I’m wondering if I can support stateful function objects, such as lambdas and functor structs with nonstatic data members. I have learned that nvstd::function cannot become a kernel parameter nor be passed from host to device according to the CUDA programming guide. What else can I do?

Do I have to resort to the void pointer to context pattern, like

using UserFunction = bool(*)(float foo, float bar, void* context);

__global__ kernel(UserFunction user_func, void * user_context) { 
  user_func(foo, bar, user_context);
}

You could pass callable objects via template parameter.

#include <cstdio>

template<class OP>
__global__
void kernel(int a, OP op){
    int result = op(a);
    printf("result is %d\n", result);
}

struct MyOp{
    int i;
    __host__ __device__
    MyOp(int ii) : i(ii){}
    __device__
    int operator()(int a) const { return a + i; }
};

int main(){
    kernel<<<1,1>>>(10, MyOp(1));
    cudaDeviceSynchronize();
    kernel<<<1,1>>>(10, MyOp(10));
    cudaDeviceSynchronize();
}

Thanks! And once the stateful user function has been passed into the kernel (made it onto the device) via a template parameter, could I forward it to a separately compiled device function, possibly via a nvstd::function parameter?

functors and lambdas can both carry “state” and be passed to kernels. Every C++ lambda by definition has a unique type signature (this is not unique or specific to CUDA). I think it should be possible to create a specific type signature of a functor and use that with or without templating, if desired, and/or with nvstd::function.

If you want to jump through hoops it may be possible to use nvstd::function in device code with lambdas. As already indicated, some utilization of templating will be necessary, probably.

1 2 3

I experimented a little and found what I suggested to be feasible, aka passing arbitrary functors into kernels through template parameters, but then forwarding the functor to another, separately compiled device function through a nvstd::function parameter. This is shown below.

// This could be separately compiled into a static library
__device__ float ClampedBinaryOp(nvstd::function<float(float, float)> func,
                                 float lhs, float rhs, float lb, float ub) {
  return min(max(func(lhs, rhs), lb), ub);
}

// This would live in a header
template <class OP>
__global__ void Kernel(OP op, size_t len, float const* lhs, float const* rhs,
                       float lb, float ub, float* result) {
  if (auto tid = blockDim.x * blockIdx.x + threadIdx.x; tid < len) {
    result[tid] = ClampedBinaryOp(op, lhs[tid], rhs[tid], lb, ub);
  }
}

This appears to satisfy my intention to do type erasure as soon as possible past a templated interface, just like how the renowned fmt library lets the generic fmt::format call down to the compiled fmt::vformat for heavy lifting. Please let me know if my approach is incorrect or inefficient.

Thank you @Robert_Crovella, your StackOverflow post answered several questions I have about nvstd::function, so can I conclude that:

  1. nvstd::function or references to which cannot be used in the parameter type of a __global__ function,
  2. Pointers to nvstd::function, allocated on device, can be used in the parameter type of a __global__ function,
  3. (tricky part) The pointee nvstd::function must be assigned a target inside a kernel — no cudaMemcpy\w* function can do the same thing?