Why is the cublasSdot() function slow in a certain section?

Environment :

OS : windows 10(x64)
GPU : gtx 1080ti
TOOL : visual studio 2015
cuda version : 9.1
RAM : 16G
CPU : i7-4790

I am using CUDA programming in my project. I use NVIDIA’s PCG algorithm for this project. After various processing using cuda, PCG algorithm enters. In the PCG algorithm, the first place to find the value of r1 is before the while statement starts. The cublasSdot () function is used there. However, this function works very slowly. I can not figure out how the cublasSdot () function is coded, but I guess it’s used internally as cudamemcpy(). Before using the PCG in my program, much more computation is not time-consuming. I’m not sure why it works very slowly when using the cublasSdot () function.

Please help me.

There is a comment on your cross-posting which I believe provides one possible explanation:

https://stackoverflow.com/questions/51112035/why-is-the-cublassdot-function-slow-in-a-certain-section

without an example code, I doubt anyone could explain your observation definitively.