cpu usage while waiting for kernel

It is “high” because cudaThreadSynchronize() is effectively a spin lock which polls the GPU at rather high frequency, waiting until the GPU kernel is finished. Because the CPU thread is just sitting in a polling loop, it actually isn’t doing much work. Since CUDA 2.3, I understand you can control the frequency of polling if it really bothers you.