Hi All,
I just bought a new Asus 1080 Ti, which has 28 streaming processors.
Then i write a program which creates 28 OpenMP threads, each issues kernel calls to its own streams (explicitly created, not 0).
I observed severe serialization among the streams in visual profiler.
When i print out the streams in the host code, it shows 28 different ids (addresses) evenly distributed.
When i print out the smid in kernel, [url]cuda - How can I find out which thread is getting executed on which core of the GPU? - Stack Overflow, it shows most kernels are executing on sm 0, totaling 8163 out of 8192.
My platform is Win7 64 bit, driver 398.11, CUDA toolkit 9.2
Question: what can possibly go wrong ?? Thanks!