Force nested inlining to avoid redundant function calls

Hello!
I am not sure where to ask questions about CUDA, so haven’t found anything better than cuda-gdb topic.

I am trying to write the simple visitor paradigm to avoid reusing the same code several times, like this:

#include <nvfunctional>

__device__ inline void visit(int* data, int n, nvstd::function<int(int)> callback) {
    for (int i = 0; i < n; ++i) {
        data[i] = callback(data[i]);
    }
}

__global__ void square(int* array, int n, int m) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid >= n) return;

    visit(&array[tid * m], m, [] __device__ (int x) -> int {
        return x*x;
    });
}

__global__ void cube(int* array, int n, int m) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid >= n) return;

    visit(&array[tid * m], m, [] __device__ (int x) -> int {
        return x*x*x;
    });
}

It’s just a simple example the real code is little bit more complex but the idea is the same - we have one visitor function, that selects some elements and call the callback function, which should do some job and can be different for different invocations of visitor. The problem here is that callback call on each data sample is too much overhead. I expect that the visitor function should be inlined into the caller kernel, and that lambda should be inlined in the visitor like that:

__global__ void square(int* array, int n, int m) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid >= n) return;

    // visitor and lambda are inlined
    for (int i = 0; i < n; ++i) {
        data[i] = data[i] * data[i];
    }
}

I can write this manually, but it will lead to repeating of the visitor part, which is more complex in real application then just single for loop. And if I use the approach above, it leads to extremely low performance due to additional function calls and i checked on compiler explorer that the example above really make the calls. I’ve tried to avoid them by adding __forceinline__ to the visitor, however it looks like there is no way to force the lambda inlining. Is it impossible for compiler to perform this nested inlining? Or is there some way?

You could pass the lambda directly to a function template, without intermediate nvstd::function.

template<class Func>
__device__ 
void visit(int* data, int n, Func callback) {
    for (int i = 0; i < n; ++i) {
        data[i] = callback(data[i]);
    }
}
2 Likes

Thank you, it really solves my problem! It looks like the raw function pointer int(*visit)(int) works as well and Func template resolves to it (at least on this toy example, on my project I’ve caught some compile errors when used raw function pointers, but maybe just made some silly mistakes). And function casted to nvstd::function becomes inlined inside the object, but the nvstd::function call is not inlined or something like that.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.