profiling application with TBB

Hi everyone!

I have several questions about my GPU program, which are conected to GPU multitasking.

It is important to notice that I have a GeForce GTX Titan card, which has 14 SMX.

Now I have to briefly describe my programm:
It has 14 TBB-threads on CPU, each one has it`s own cuda stream.
In this streams each TBB thread consistently launches small cuda kernels, each one with 1 GPU-block and 1024 GPU-threads.
If I profile my application, there may be two different pictures of these kernels behavior:
the first: https://www.dropbox.com/s/0eg9q36spzmxuj6/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202014-10-10%2019.05.48.png?dl=0
the second: https://www.dropbox.com/s/sgohchhcdnkuseo/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202014-10-10%2019.11.22.png?dl=0

I`m pretty sure that the algoritm in which these kernels will be put on different SMX is undefined and depends on driver, but one thing is more interesting for me:
when I profile my appliation - in almost all cases I have the first picture and very rarely the second one.
But when I launch my programm without profiling - I have the second picture in almost all cases and never the first.
So, the thing I want to know - probably a can somehow handle this situation and get the first situation in all cases, when all kernels are launched abselutely parallel?

Thank you for your help!