cudaDeviceSynchronize is very slow

I have to use the result from cuda kernel function at following cpu host code,
so at the just below of kernel function, put the cudaDeviceSynchronize function.
then it’s very slow, so the time saving gained by using kernel function has gone.
time saved to below 100ms using cuda kernel, but cudaDeviceSynchronize takes 150ms.
it makes me doubt there is no need to cuda programming.
Please clarify me, thanks in advance.

Are you using managed memory? I’ve found that when I use managed memory (such as with cudaMallocManaged) the cudaDeviceSynchronize forces all the contents to be copied over.

So if you are using a larger than necessary buffer size to work with your data, try using as small a vector as possible. I’m not sure if 150 ms is normal or not, but with enough managed memory allocated I can see that being possible.