Is concurrent kernel execution really possible on TK1 ?

Considering that Tk1 has single SM, is it really possible to run streams concurrently ? I have been unable to do so, even with latest vesions of cuda libraries.

So is it really possible ? any sample code would be great. The sample code under cuda Blas also runs sequential as show on visual profiler.

Also a better insight into what “Streams” are good for in a Single SM ?