Multiple CUDA streams assigned mostly to sm 0

Hi All,

I just bought a new Asus 1080 Ti, which has 28 streaming processors.

Then i write a program which creates 28 OpenMP threads, each issues kernel calls to its own streams (explicitly created, not 0).

I observed severe serialization among the streams in visual profiler.

When i print out the streams in the host code, it shows 28 different ids (addresses) evenly distributed.

When i print out the smid in kernel, [url]cuda - How can I find out which thread is getting executed on which core of the GPU? - Stack Overflow, it shows most kernels are executing on sm 0, totaling 8163 out of 8192.

My platform is Win7 64 bit, driver 398.11, CUDA toolkit 9.2

Question: what can possibly go wrong ?? Thanks!