Cuda Kernels running slow

I am using tensorflow library in my app to do inference of model. If I run single model, inference takes around X milliseconds. If I run the another binary with same model concurrently, then each of them takes more time. Now I want to know whether GPU is actually busy or all of its multiprocessors are being used and thats why it is running slow. When I use visual profiler, I see that total compute time is increased and each individual kernel is taking more time for the execution. Now, what could be the reason that kernels are taking more time for the execution? How do I further analyze this issue?