I found a document describing how threads are starting (written in Japanese).
6.4 Thread, Glock Grid (written in Japanese)
The document is saying that when threads are started, each thread is executed on each clock timing.
If I create 512 threads, the last thread will be started after a delay of 512 clocks from the first.
Also __syncthread() is introduced to avoid this delaying problem.
Is this true?
I believed that the limit of 32 warps, 32 threads can be executed on the same clock-phase.
who writes the ‘text’ in fortune cookies - the japanese…?
what does the japanese document say about warp schedulers, and what is the ratio between warp schedulers and warps…?
The document appears to reference G80 (compute capability 1.0). The statements above are not true.
Thank you little_jimmy and Greg!!
Now I understand the mystery.
For a begginer like me, it is very difficult to judge a document is expired or obsoleted.
There are some web sites, which defines the max number of threads in a block is 512.
I hope that eveyone should specify their hardware, driver, SDK and OS.