I just bought a new Asus 1080 Ti, which has 28 streaming processors.
Then i write a program which creates 28 OpenMP threads, each issues kernel calls to its own streams (explicitly created, not 0).
I observed severe serialization among the streams in visual profiler.
When i print out the streams in the host code, it shows 28 different ids (addresses) evenly distributed.
When i print out the smid in kernel, https://stackoverflow.com/questions/28881491/how-can-i-find-out-which-thread-is-getting-executed-on-which-core-of-the-gpu, it shows most kernels are executing on sm 0, totaling 8163 out of 8192.
My platform is Win7 64 bit, driver 398.11, CUDA toolkit 9.2
Question: what can possibly go wrong ?? Thanks!