cudaLaunchHostFunc API example

I have added cudaLaunchHostFunc after every cuda kernel to calculate running time of cuda kernel. From nsys, it seems cudaLaunchHostFunc will affect the time interval between two adjacent cuda kernel function calls. The following two pictures are same program without cudaLaunchHostFunc and with cudaLaunchHostFunc. From the SM active sparsity and total time of program, I found the running time of program with cudaLaunchHostFunc is larger than that without cudaLaunchHostFunc.