Now, I’m wondering if I can support stateful function objects, such as lambdas and functor structs with nonstatic data members. I have learned that nvstd::function cannot become a kernel parameter nor be passed from host to device according to the CUDA programming guide. What else can I do?
Do I have to resort to the void pointer to context pattern, like
using UserFunction = bool(*)(float foo, float bar, void* context);
__global__ kernel(UserFunction user_func, void * user_context) {
user_func(foo, bar, user_context);
}
You could pass callable objects via template parameter.
#include <cstdio>
template<class OP>
__global__
void kernel(int a, OP op){
int result = op(a);
printf("result is %d\n", result);
}
struct MyOp{
int i;
__host__ __device__
MyOp(int ii) : i(ii){}
__device__
int operator()(int a) const { return a + i; }
};
int main(){
kernel<<<1,1>>>(10, MyOp(1));
cudaDeviceSynchronize();
kernel<<<1,1>>>(10, MyOp(10));
cudaDeviceSynchronize();
}
Thanks! And once the stateful user function has been passed into the kernel (made it onto the device) via a template parameter, could I forward it to a separately compiled device function, possibly via a nvstd::function parameter?
functors and lambdas can both carry “state” and be passed to kernels. Every C++ lambda by definition has a unique type signature (this is not unique or specific to CUDA). I think it should be possible to create a specific type signature of a functor and use that with or without templating, if desired, and/or with nvstd::function.
If you want to jump through hoops it may be possible to use nvstd::function in device code with lambdas. As already indicated, some utilization of templating will be necessary, probably.
I experimented a little and found what I suggested to be feasible, aka passing arbitrary functors into kernels through template parameters, but then forwarding the functor to another, separately compiled device function through a nvstd::function parameter. This is shown below.
// This could be separately compiled into a static library
__device__ float ClampedBinaryOp(nvstd::function<float(float, float)> func,
float lhs, float rhs, float lb, float ub) {
return min(max(func(lhs, rhs), lb), ub);
}
// This would live in a header
template <class OP>
__global__ void Kernel(OP op, size_t len, float const* lhs, float const* rhs,
float lb, float ub, float* result) {
if (auto tid = blockDim.x * blockIdx.x + threadIdx.x; tid < len) {
result[tid] = ClampedBinaryOp(op, lhs[tid], rhs[tid], lb, ub);
}
}
This appears to satisfy my intention to do type erasure as soon as possible past a templated interface, just like how the renowned fmt library lets the generic fmt::format call down to the compiled fmt::vformat for heavy lifting. Please let me know if my approach is incorrect or inefficient.
Thank you @Robert_Crovella, your StackOverflow post answered several questions I have about nvstd::function, so can I conclude that:
nvstd::function or references to which cannot be used in the parameter type of a __global__ function,
Pointers to nvstd::function, allocated on device, can be used in the parameter type of a __global__ function,
(tricky part) The pointee nvstd::function must be assigned a target inside a kernel — no cudaMemcpy\w* function can do the same thing?