profiling application with TBB

Hi everyone!

I have several questions about my GPU program, which are conected to GPU multitasking.

It is important to notice that I have a GeForce GTX Titan card, which has 14 SMX.

Now I have to briefly describe my programm:
It has 14 TBB-threads on CPU, each one has it`s own cuda stream.
In this streams each TBB thread consistently launches small cuda kernels, each one with 1 GPU-block and 1024 GPU-threads.
If I profile my application, there may be two different pictures of these kernels behavior:
the first:
the second:

I`m pretty sure that the algoritm in which these kernels will be put on different SMX is undefined and depends on driver, but one thing is more interesting for me:
when I profile my appliation - in almost all cases I have the first picture and very rarely the second one.
But when I launch my programm without profiling - I have the second picture in almost all cases and never the first.
So, the thing I want to know - probably a can somehow handle this situation and get the first situation in all cases, when all kernels are launched abselutely parallel?

Thank you for your help!