The efficiency problem when calling a virtual function in a virtual function

On GPU, if another virtual function is called in a virtual function with device prefix, will the efficiency be very low? I found in programming that when I commented out a virtual function call with empty content, the calculation efficiency increased by 30% (78ms was reduced to 53ms). Does this mean that the virtual function call took 25ms?

I don’t see any evidence of that:

$ cat t1857.cu
class A{
  public:
    virtual __device__ void f1(int *){};
    virtual __device__ void f2(int *){};
};

class C : public A {
  public:
    __device__ void f1(int *a) { *a = 0;
#ifndef SKIP
      f2(a);
#endif
      }
    __device__ void f2(int *a) { }
};

__global__ void k(int *a){
  C c;
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  c.f1(a+idx);
}
const int ds = 1048576*32;
int main(){

  int *a;
  cudaMalloc(&a, ds*sizeof(int));
  cudaMemset(a, 0, ds*sizeof(int));
  k<<<ds/256, 256>>>(a);
  cudaDeviceSynchronize();
}

$ nvcc -o t1857 t1857.cu
$ nvprof ./t1857
==14482== NVPROF is profiling process 14482, command: ./t1857
==14482== Profiling application: ./t1857
==14482== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   59.94%  232.49us         1  232.49us  232.49us  232.49us  k(int*)
                   40.06%  155.36us         1  155.36us  155.36us  155.36us  [CUDA memset]
      API calls:   97.62%  335.66ms         1  335.66ms  335.66ms  335.66ms  cudaMalloc
                    1.45%  4.9783ms         4  1.2446ms  587.30us  3.1994ms  cuDeviceTotalMem
                    0.70%  2.4169ms       404  5.9820us     350ns  265.42us  cuDeviceGetAttribute
                    0.11%  368.67us         1  368.67us  368.67us  368.67us  cudaDeviceSynchronize
                    0.08%  259.96us         4  64.989us  56.940us  82.617us  cuDeviceGetName
                    0.02%  69.254us         1  69.254us  69.254us  69.254us  cudaMemset
                    0.01%  35.928us         8  4.4910us     405ns  30.633us  cuDeviceGet
                    0.01%  29.075us         1  29.075us  29.075us  29.075us  cudaLaunchKernel
                    0.01%  21.523us         4  5.3800us  2.9120us  10.645us  cuDeviceGetPCIBusId
                    0.00%  5.2010us         3  1.7330us     472ns  3.5170us  cuDeviceGetCount
                    0.00%  3.2030us         4     800ns     597ns  1.1300us  cuDeviceGetUuid
$ nvcc -o t1857 t1857.cu -DSKIP
$ nvprof ./t1857
==14526== NVPROF is profiling process 14526, command: ./t1857
==14526== Profiling application: ./t1857
==14526== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   59.96%  233.38us         1  233.38us  233.38us  233.38us  k(int*)
                   40.04%  155.88us         1  155.88us  155.88us  155.88us  [CUDA memset]
      API calls:   97.54%  330.94ms         1  330.94ms  330.94ms  330.94ms  cudaMalloc
                    1.48%  5.0341ms         4  1.2585ms  590.66us  3.2433ms  cuDeviceTotalMem
                    0.72%  2.4413ms       404  6.0420us     348ns  278.22us  cuDeviceGetAttribute
                    0.11%  380.98us         1  380.98us  380.98us  380.98us  cudaDeviceSynchronize
                    0.11%  375.39us         4  93.848us  59.047us  184.09us  cuDeviceGetName
                    0.02%  68.459us         1  68.459us  68.459us  68.459us  cudaMemset
                    0.01%  27.583us         1  27.583us  27.583us  27.583us  cudaLaunchKernel
                    0.01%  18.184us         4  4.5460us  2.9520us  7.1860us  cuDeviceGetPCIBusId
                    0.00%  6.2750us         8     784ns     390ns  1.5550us  cuDeviceGet
                    0.00%  3.2100us         4     802ns     597ns  1.0970us  cuDeviceGetUuid
                    0.00%  3.0780us         3  1.0260us     524ns  1.6140us  cuDeviceGetCount
$

CUDA 11.2, CentOS 7, V100

I would say that in general, I consider performance analysis by commenting things out to be fraught with peril/confusion, because of the nature of the optimizing compiler. However I don’t see that that would apply based on your description, and really can’t explain your observation.

(Note, adding -rdc=true to the compile command lines above made no difference in the observation for that test case.)