Hi,
I’m interested in using nvstd::function
in a project, but the way I intend to use it seems to conflict with the guidelines described in the official CUDA C Programming guide, so I’m hoping to find clarification if what I’m doing is admissible or if it’s undefined behavior.
Concretely, my use-case looks like:
#include <nvfunctional>
#include <cinttypes>
#include <iostream>
// my library has a handful of __device__ functions
// and I'd like to be able to pass these functions around from kernel to kernel
__device__ double foo(double x) { return x * x; }
// This kernel accepts a __device__ fn pointer, and uses it to create an
// nvstd::function of the appropriate signature
template < auto f >
__global__ void create_func(nvstd::function< double(double) > * func) {
*func = f;
}
// This kernel evaluates the provided nvstd::function and saves the output
__global__ void apply_func(nvstd::function< double(double) > * func, double * output) {
*output = (*func)(3.0);
}
int main() {
nvstd::function<double(double)> * func;
cudaMallocManaged(&func, sizeof(nvstd::function<double(double)>));
double * output;
cudaMallocManaged(&output, sizeof(double));
// create the nvstd::function object
create_func<foo><<<1,1>>>(func);
cudaDeviceSynchronize();
// and later invoke it in a separate kernel
apply_func<<<1,1>>>(func, output);
cudaDeviceSynchronize();
std::cout << *output << std::endl;
}
This example seems to compile and run with nvcc
(12.2) and clang++
(16), it produces the expected output, and compute-sanitizer
reports no issues. However, in this section of the CUDA C Programming guide, it mentions:
nvstd::function
cannot be used in the parameter type of a__global__
function, if the__global__
function is launched from host code.
Does this mean that my code is not expected to work, or is it the case that only passing nvstd::function
s by-value to __global__
functions is prohibited?
I understand that runtime polymorphism and cuda kernels is a potential footgun, since copying objects between host and device doesn’t necessarily move the vtable, so virtual functions can segfault. But, I was also under the impression that if the object was created in a cuda kernel, then its vtable would be in device memory, so it could be used in subsequent cuda kernels.
Thank you,
Sam