The only information that you have provided is that the you are executing a series of kernels on a k20c on linux. The average CPU launch overhead is 4-8 µs. A value of 600 µs is significant and out of range for a normal launch.
If this delay occurs for every launch of sortCoordArr* then determine what explicit launch state is changed each launch and determine if reducing the state changes reduces the overhead. State changes include stack size, print fifo size, texture bindings, etc.
If this delay occurs for only the first launch of sortCoorArr* and the delay can be repeated on every run of the process then investigate what lazy initialization or updates the driver may be doing by looking at what features or additional resources are used by this kernel. Features that have lazy initialization include heap allocation (device malloc/free/new/delete), print fifo allocation (printf), and CUDA dynamic parallelism.
If this occurs sporadically then the OS thread scheduler may have simply context switched out the thread. It’s possible to use xperf to investigate this issue.
The CUDA profilers introduce additional overhead on to each API call and kernel launch. The overhead is .5 - 1 µs for most API calls, 5-10 µs for concurrent kernel launches, and varies on resource initialization. CUDA 5.5 reduces the overhead. I do not think the overhead is due to the profiler. If the profiler is going to introduce high overhead it will provide additional information in the Profiling Overhead row in the timeline.