CUDA kernel is slow with function pointers

I am on the Jetson Nano with CUDA 10. I have functions which iterates through pixels and modifies them to give me filtered versions of the image. I have a regular host version of the code which has the following functions:

void processFilterBlue(uchar *);
void processFilterGreen(uchar *);
void processFilterRed(uchar *);

In my host code, I juse have a nested for loop which calls the functions by a function pointer.

void processFilter(filter_func_t filter_function, cv::Mat img) {
for(int i = 0; i < rows; ++i) {
    for(int j = 0; j < cols; ++j) {
        get pixel location;
        filter_function(pixel location); //where filter_function is one of the 3 functions above
}}}

I wanted to repeat this behavior with CUDA using the exact same three filter functions. I declared them with __host__ __device__ preceding their declarations and used the following stack overflow answer to get the same behavior with my kernel: cuda - How to pass a pointer to a device function as an argument to a kernel function? - Stack Overflow

It seems like this method really impacts the performance of the kernel. Using the CUDA Event API, I found that it was about 400 to 500 milliseconds slower. It runs around 700ms while not using function pointers takes about 200ms. (Note: I haven’t actually optimized the kernel yet. I am calling the kernel with <<<1, 1>>>. I wanted to get everything else in place before attempting to use the kernel effectively.). I tested this by just copying the function into my .cu code and calling this copied function in my kernel. From this post, it seems that nvcc does aggressive inlining which explains why function pointers are slow compared to explicit function calls. 500ms is pretty bad though. That post is 7 years old now so I was wondering if there’s anything I can do to optimize here that would narrow the performance gap. I really want to keep the function pointers so that I don’t have to have three nearly identical kernels while having their regular host counterparts.

If that’s not an option, is there some optimization for linked functions. My three filter functions are in a separate .h and .cu file which are made into an object file. Calling the function in this way offers a similar delay. Is there perhaps something in the compilation process that can be optimized. I compile my two .cu files with -dc, then I make a linker file with -dlink, and lastly I use g++ with those object files.

You could templatize the kernel using color as a template parameter, and then instantiate that three times, once per color. You could then invoke the entire kernel via a function pointer.

That’s not always the best strategy when doing a port of code to the GPU, which you are discovering. It’s a reasonable starting point, of course. But we must acknowledge (the facts here permit no other conclusion) that such a “mechanical” port starting point may result in performance issues, that may need to be addressed.

Yes, it certainly can. My guess in this case, based on the timing you’ve presented, is that your functions as tested are relatively simple. There is going to be a more-or-less fixed overhead for using your device-linked function-pointer method, and if the function you are calling is relatively trivial, and even worse you are calling that function many times over from the same kernel, you’re going to make the problem appear about as bad as it can get. On the other hand, if each of your functions that you are calling via device function pointers was itself using ~500 milliseconds per call, then I think the overhead would be negligible, in terms of overall percentage performance impact. But my guess is you have relatively trivial functions, and the percentage impact of the overhead is therefore large.

From a performance perspective, the best suggestion I can offer is to get those functions into the same compilation unit as your kernel, and eliminate the relocatable device code linking step. If you still want to have a “modicum of modularity” you can still define your color functions in separate files (wither .cu or .cuh, does not matter) and just #include those in the file that has the kernel function. (just means only include these files in other source files. Do not also add these files to the project or call the compiler on those files separately)

Not in CUDA 10.

The latest CUDA versions (e.g. CUDA 11.1) have some link-time-optimization features. The function pointer method might reduce the ability of the compiler to do link time optimization, however, so I would suggest it is something that needs to be tested to see if it provides any benefit for your specific use-case.

I suppose it may be worthwhile to point out that some additional inefficiency may arise in your approach when you proceed beyond a <<<1,1>>> kernel configuration. You haven’t posted valid C++ code (to wit: get pixel location;), but I imagine there is a possibility here that you may process colors RGBRGBRGB… etc. across threads. The function pointer approach here may also not play very will with that pattern across threads in a warp. But it’s not possible for me to say how much of a concern that may be without actual testing of code. The “cost” may already be evident/inherent in your measurement now.

So the actual function looks like this:

__global__
void kernel(filter_func_t filter_function, int rows, int cols, uchar *data, int step)
{
    for(int i = 0; i < rows; ++i) {
        uchar *data_ptr = data + (i * step);

        for(int j = 0; j < cols; ++j) {
            int pixel_loc = j * 3;
            filter_function(data_ptr + pixel_loc);
        }
    }
}

The filter functions are relatively simple, with slight variation between the three:

__host__ __device__
void processFilterBlue(uchar *bgr)
{
    uchar new_pix_val;

    int b = (int) bgr[0] - (int) bgr[2];
    if(b < 0)
        new_pix_val = 0;
    else
        new_pix_val = (uchar) b;

    bgr[0] = new_pix_val;
    bgr[1] = new_pix_val / 2;
    bgr[2] = new_pix_val / 2;
}

Would a different kernel configuration cause issues with this sort of code? (This is also my first CUDA code that I’m doing so I’m not aware of these things yet.)

I guess I could just go with this way. But it sounds like I’ll still need three kernels, but each calls their respective functions? I guess it’s not a terrible tradeoff. Although my header file containing the three functions would still have __host__ __device__. So would I use the #ifdef __NVCC__ macro preceding the qualifier statement or is it some other macro?

Could you give an example? I’m not sure what you mean.

Here’s a sample code covering a few of the ideas discussed:

$ cat colorfuncs.cu

//enum color {R, G, B};

template <color C>
__host__ __device__ void filter_function(unsigned char *bgr){
  unsigned char new_pix_val = 0;
  int p = 0;
  switch(C){
          case B:
            p = (int) bgr[0] - (int) bgr[2];
            if(p < 0)
              new_pix_val = 0;
            else
              new_pix_val = (unsigned char) p;
            bgr[0] = new_pix_val;
            bgr[1] = new_pix_val / 2;
            bgr[2] = new_pix_val / 2;
            break;
          case G:
          case R:
          default:
            break;
  }
}
$ cat t84.cu
#include <iostream>
enum color {R, G, B};
#include "colorfuncs.cu"

template <color C>
__global__ void kernel(int rows, int cols, unsigned char *data, int step){
  for (int idy = threadIdx.y+blockDim.y*blockIdx.y; idy < rows; idy += gridDim.y*blockDim.y){
    unsigned char *data_ptr = data + (idy * step);
    for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < cols; idx += gridDim.x*blockDim.x){
      int pixel_loc = idx * 3;
      filter_function<C>(data_ptr + pixel_loc);
    }
  }
}


void kernel_caller(color C, int width, int height, unsigned char *data){
  dim3 grid((width+31)/32, (height+31)/32);
  dim3 block(32, 32);
  if (C == R)
    kernel<R><<<grid, block>>>(height, width, data, width*3);
  if (C == G)
    kernel<G><<<grid, block>>>(height, width, data, width*3);
  if (C == B)
    kernel<B><<<grid, block>>>(height, width, data, width*3);
 }


int main(){
  const int imw = 1024;
  const int imh = 768;
  unsigned char *d;
  cudaMalloc(&d, imw*imh*sizeof(d[0])*3);
  kernel_caller(B, imw, imh, d);
  cudaDeviceSynchronize();
}

$ nvcc -o t84 t84.cu
$ cuda-memcheck ./t84
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t84
==23116== NVPROF is profiling process 23116, command: ./t84
==23116== Profiling application: ./t84
==23116== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  78.497us         1  78.497us  78.497us  78.497us  void kernel<color=2>(int, int, unsigned char*, int)
      API calls:   99.04%  144.05ms         1  144.05ms  144.05ms  144.05ms  cudaMalloc
                    0.67%  973.09us       202  4.8170us     176ns  227.70us  cuDeviceGetAttribute
                    0.13%  183.95us         2  91.976us  80.894us  103.06us  cuDeviceTotalMem
                    0.07%  101.20us         2  50.597us  50.251us  50.944us  cuDeviceGetName
                    0.06%  83.445us         1  83.445us  83.445us  83.445us  cudaDeviceSynchronize
                    0.02%  31.197us         1  31.197us  31.197us  31.197us  cudaLaunchKernel
                    0.01%  9.4460us         2  4.7230us  1.8450us  7.6010us  cuDeviceGetPCIBusId
                    0.00%  6.2490us         4  1.5620us     214ns  5.5830us  cuDeviceGet
                    0.00%  1.2810us         3     427ns     241ns     661ns  cuDeviceGetCount
                    0.00%     581ns         2     290ns     256ns     325ns  cuDeviceGetUuid
$

The kernel in the above code is designed to run as a 2D grid-stride loop which means you can choose more-or-less arbitrary dimensions for the grid, including running on a single thread if you wish. When I run it on a single thread on my GTX960 the kernel takes ~188ms, pretty close to your ~200ms measurement report.

I see the Robert Crovella already posted example code. Since I had already prepared example code but didn’t get around to posting it until now, I’ll post it even though it is largely redundant. In this example, a templated kernel (__global__ function) operates on an array in one of three different ways. An array of three function pointers is initialized so each pointer points to one of the three possible instantiations of the kernel. The three operations are then applied in turn on an array of numerical data by iterating over these function pointers.

    #include <cstdio>
    #include <cstdlib>

    #define N (5)
    typedef void (*funcptr)(float *, int);
    typedef enum {OP_SQRT, OP_SQR, OP_LOG, LAST} op_kind;

    template<op_kind kind>
    __global__ void apply_op (float* arr, int len)
    {
        int stride = gridDim.x * blockDim.x;
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        for (int i = tid; i < len; i += stride) {
            if (kind == OP_SQRT) {
                arr[i] = sqrtf (arr[i]);
            } else if (kind == OP_SQR) {
                arr[i] = arr[i] * arr[i];
            } else if (kind == OP_LOG) {
                arr[i] = logf (arr[i]);
            }
        }
    }

    int main (void)
    {
        float arr_host[N];
        float *arr_dev = 0;
        funcptr func[3] = {&apply_op<OP_SQRT>, &apply_op<OP_SQR>, &apply_op<OP_LOG>};
        cudaMalloc ((void **)&arr_dev, sizeof(arr_dev[0])*N);
        for (int i = 0; i < N; i++) {
            arr_host[i] = i;
        }

        printf ("arr = ");
        for (int j = 0; j < N; j++) {
            printf ("% 15.8e  ", arr_host[j]);
        }
        printf ("\n");

        cudaMemcpy (arr_dev, arr_host, sizeof(arr_dev[0])*N, cudaMemcpyHostToDevice);
        
        for (int i = OP_SQRT; i != LAST; i++) {
            func[i]<<<1,1>>>(arr_dev, N);
            cudaMemcpy (arr_host, arr_dev, sizeof(arr_host[0])*N, cudaMemcpyDeviceToHost);
            printf ("arr = ");
            for (int j = 0; j < N; j++) {
                printf ("% 15.8e  ", arr_host[j]);
            }
            printf ("\n");
        }
        cudaFree (arr_dev);
        return EXIT_SUCCESS;
    }

The output should look something like this:

arr =  0.00000000e+000   1.00000000e+000   2.00000000e+000   3.00000000e+000   4.00000000e+000
arr =  0.00000000e+000   1.00000000e+000   1.41421354e+000   1.73205078e+000   2.00000000e+000
arr =  0.00000000e+000   1.00000000e+000   1.99999988e+000   3.00000000e+000   4.00000000e+000
arr = -1.#INF0000e+000   0.00000000e+000   6.93147123e-001   1.09861231e+000   1.38629436e+000

Thanks for the help. I’m getting the performance to be about 200ms now without having optimized the kernel.

One last question though. Isn’t the templatizing unnecessary? Since we’re using a switch statement, wouldn’t a parameter work just as well? Or is there some optimization in templatizing that I’m missing.

templating allows the switch statement to be removed at compile time. It ends up being something like a jump table, but the jump table can be eliminated at compile time. This may also allow the compiler to make other kinds of inline optimizations once it knows for the sure what the jump target will be and the code beyond it. Without that the compiler has to acknowledge that there are potentially (in this case) 3 different jump targets that have to be supported. How important that is depends on your code, but it’s not zero benefit.

By “parameter” I assume you mean compile-time visible parameter. AFAIK, if you used a compile-time visible parameter to guide the jump table, and you had only one instance of your kernel usage to account for, then the compiler could probably do a similar optimization. However the compiler, without the mechanism of templating, will not/cannot (AFAIK) generate 3 different instances of the same kernel. Therefore even if you had compile time visible parameters, if you intended to call the kernel in different situations with different parameters, in the same module, I know of know way for the compiler to avoid the run-time parameter check. Templating allows the compiler (in this case) to generate 3 different kernels, with 3 different entry points, each of which can be optimized for a particular path.

Also note that a compile-time visible parameter in host code is not necessarily going to have any effect on the device code compiler behavior. They are pretty much separate tools.

I am not a compiler engineer, so this is really on the edge of my knowledge. Others may wish to comment or know better. If it’s important, you have all the tools at your disposal to inspect compiler behavior, and draw your own conclusions.

1 Like

So I was experimenting with your solution and hadn’t done all the grid and block stuff yet in the kernel. I finally got around to it and the code stopped working. I was getting an error from OpenCV, “CUDA driver version is insufficient for CUDA runtime version.” Based on other things on this forum that I have seen, this error doesn’t really make sense. Jetson Nano comes pre installed with CUDA 10 so I’m assuming everything with installation is fine there. I even reverted my code back to remove all the grid stuff and I still had the same error. I then thought maybe it’s a weird bug so I rebooted the Nano. Now I can’t seem to turn it on. I tried two monitors and the device starts up and sometimes I see some quick kernel messages. And then it gets stuck. I don’t know if the HDMI port broke or what, but the display sort of flashes like it’s recognized and then not recognized. This was also two different HDMI cables on two different monitors so I don’t think it’s the cable. Any thoughts on what happened?

Maybe sudo apt upgrade caused an issue. I was running that in the background and wasn’t paying attention to it. I didn’t bother checking if it was finished or not, but I would have figured a clean reboot from sudo reboot wouldn’t affect that really. Let me know if it would be better to ask this on the Jetson Nano forums.

EDIT:
The messages I saw one time were:

tegradc: tegradc.1: dpd enable lookup fail:-19
cp: not writing through dangling symlink 'etc/resolv.conf'
cgroup:cgroup2: unknown option "nsdelegate"
//i think the rest are normal
using random self ethernet address
using random self ethernet address
random: crng init done
random: 7 urandom warning(s) missed due to ratelimiting
using random self ethernet address
using random host ethernet address

I think that is likely.