Multiple CUDA streams assigned mostly to sm 0

Hi All,

I just bought a new Asus 1080 Ti, which has 28 streaming processors.

Then i write a program which creates 28 OpenMP threads, each issues kernel calls to its own streams (explicitly created, not 0).

I observed severe serialization among the streams in visual profiler.

When i print out the streams in the host code, it shows 28 different ids (addresses) evenly distributed.

When i print out the smid in kernel, https://stackoverflow.com/questions/28881491/how-can-i-find-out-which-thread-is-getting-executed-on-which-core-of-the-gpu, it shows most kernels are executing on sm 0, totaling 8163 out of 8192.

My platform is Win7 64 bit, driver 398.11, CUDA toolkit 9.2

Question: what can possibly go wrong ?? Thanks!