When I measure execution time in certain parts of a function calling a kernel, such as cudaMalloc, cudaMemcpy from CPU to GPU, the kernel itself, cudaDeviceSynchronize, and memcpy back from GPU to CPU I find out that all parts except synchronization last e.g. 0.002042 sec and synchronization itself lasts 0.108922. Therefore i assume that execution time is ovelwhelmed by cudaDeviceSynchronize. Why is that? Is there a way to minimize synchronization time. I tried different values regarding the kernel grid size and block size. I found out that when syncrhonizatioin time decreases, the actual kernel time increases. Therefore I do not seem to find a way to decrease total duration.