I already found this topic but didn’t want to bump it.
I have a very simple Kernel and measured its performance with the Cuda Profiler. It takes about 11 usec GPU Time but ~1400 usec CPU Time. Is there any way I can decrease this value? How does it come, that the CPU Time is that high?
If I can’t lower the time needed by the CPU, does that mean my problem is just “too easy” or “too small” to benefit from CUDA?