On GPU, if another virtual function is called in a virtual function with device prefix, will the efficiency be very low? I found in programming that when I commented out a virtual function call with empty content, the calculation efficiency increased by 30% (78ms was reduced to 53ms). Does this mean that the virtual function call took 25ms?
I don’t see any evidence of that:
$ cat t1857.cu
class A{
public:
virtual __device__ void f1(int *){};
virtual __device__ void f2(int *){};
};
class C : public A {
public:
__device__ void f1(int *a) { *a = 0;
#ifndef SKIP
f2(a);
#endif
}
__device__ void f2(int *a) { }
};
__global__ void k(int *a){
C c;
int idx = threadIdx.x+blockDim.x*blockIdx.x;
c.f1(a+idx);
}
const int ds = 1048576*32;
int main(){
int *a;
cudaMalloc(&a, ds*sizeof(int));
cudaMemset(a, 0, ds*sizeof(int));
k<<<ds/256, 256>>>(a);
cudaDeviceSynchronize();
}
$ nvcc -o t1857 t1857.cu
$ nvprof ./t1857
==14482== NVPROF is profiling process 14482, command: ./t1857
==14482== Profiling application: ./t1857
==14482== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 59.94% 232.49us 1 232.49us 232.49us 232.49us k(int*)
40.06% 155.36us 1 155.36us 155.36us 155.36us [CUDA memset]
API calls: 97.62% 335.66ms 1 335.66ms 335.66ms 335.66ms cudaMalloc
1.45% 4.9783ms 4 1.2446ms 587.30us 3.1994ms cuDeviceTotalMem
0.70% 2.4169ms 404 5.9820us 350ns 265.42us cuDeviceGetAttribute
0.11% 368.67us 1 368.67us 368.67us 368.67us cudaDeviceSynchronize
0.08% 259.96us 4 64.989us 56.940us 82.617us cuDeviceGetName
0.02% 69.254us 1 69.254us 69.254us 69.254us cudaMemset
0.01% 35.928us 8 4.4910us 405ns 30.633us cuDeviceGet
0.01% 29.075us 1 29.075us 29.075us 29.075us cudaLaunchKernel
0.01% 21.523us 4 5.3800us 2.9120us 10.645us cuDeviceGetPCIBusId
0.00% 5.2010us 3 1.7330us 472ns 3.5170us cuDeviceGetCount
0.00% 3.2030us 4 800ns 597ns 1.1300us cuDeviceGetUuid
$ nvcc -o t1857 t1857.cu -DSKIP
$ nvprof ./t1857
==14526== NVPROF is profiling process 14526, command: ./t1857
==14526== Profiling application: ./t1857
==14526== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 59.96% 233.38us 1 233.38us 233.38us 233.38us k(int*)
40.04% 155.88us 1 155.88us 155.88us 155.88us [CUDA memset]
API calls: 97.54% 330.94ms 1 330.94ms 330.94ms 330.94ms cudaMalloc
1.48% 5.0341ms 4 1.2585ms 590.66us 3.2433ms cuDeviceTotalMem
0.72% 2.4413ms 404 6.0420us 348ns 278.22us cuDeviceGetAttribute
0.11% 380.98us 1 380.98us 380.98us 380.98us cudaDeviceSynchronize
0.11% 375.39us 4 93.848us 59.047us 184.09us cuDeviceGetName
0.02% 68.459us 1 68.459us 68.459us 68.459us cudaMemset
0.01% 27.583us 1 27.583us 27.583us 27.583us cudaLaunchKernel
0.01% 18.184us 4 4.5460us 2.9520us 7.1860us cuDeviceGetPCIBusId
0.00% 6.2750us 8 784ns 390ns 1.5550us cuDeviceGet
0.00% 3.2100us 4 802ns 597ns 1.0970us cuDeviceGetUuid
0.00% 3.0780us 3 1.0260us 524ns 1.6140us cuDeviceGetCount
$
CUDA 11.2, CentOS 7, V100
I would say that in general, I consider performance analysis by commenting things out to be fraught with peril/confusion, because of the nature of the optimizing compiler. However I don’t see that that would apply based on your description, and really can’t explain your observation.
(Note, adding -rdc=true
to the compile command lines above made no difference in the observation for that test case.)