Multiple CUDA streams assigned mostly to sm 0

Hi All,

I just bought a new Asus 1080 Ti, which has 28 streaming processors.

Then i write a program which creates 28 OpenMP threads, each issues kernel calls to its own streams (explicitly created, not 0).

I observed severe serialization among the streams in visual profiler.

When i print out the streams in the host code, it shows 28 different ids (addresses) evenly distributed.

When i print out the smid in kernel,, it shows most kernels are executing on sm 0, totaling 8163 out of 8192.

My platform is Win7 64 bit, driver 398.11, CUDA toolkit 9.2

Question: what can possibly go wrong ?? Thanks!