TCC mode performance

Hi
I have Quadro P4000. when launching the same workload WDDM mode shows better performance than TCC
I checked the profiler and it seems that driver latency is greatly reduced in TCC mode however kernel execution times are longer in TCC mode for the same kernels.

I checked SM and memory clocks and the seem to be at max in both modes.
What could be the reason for the slowdown when using TCC mode.

Thanks.

Hi Liran,
Could you file a separate bug from “https://developer.nvidia.com/->my account->my bug” for tracking the issue easily and updating you bug status quickly in the future.
Could you also provide us a self-contained reproducer and the steps you run into issue?
You can send attachment to CUDAIssues@nvidia.com for file exchange.
Thanks for your cooperation.