Considering that Tk1 has single SM, is it really possible to run streams concurrently ? I have been unable to do so, even with latest vesions of cuda libraries.
So is it really possible ? any sample code would be great. The sample code under cuda Blas also runs sequential as show on visual profiler.
Also a better insight into what “Streams” are good for in a Single SM ?