cuBLAS causes 7% performance drop in subsequent kernels

GPU trace without cuBLAS calls.

GPU trace with a single cuBLAS call. Performance of default_function_kernel0 drops by roughly 7%.

I have been doing some performance benchmarking on RTX 2070, and observe from the GPU traces above that a single cuBLAS call can lead to 7% performance degradation in subsequent kernels. I did this experiment multiple times and the performance degradation happens every time. Could someone please give me some hints on what could possibly be the root cause for the performance drop? Thanks.