I found a document describing how threads are starting (written in Japanese).
6.4 Thread, Glock Grid (written in Japanese)
The document is saying that when threads are started, each thread is executed on each clock timing.
If I create 512 threads, the last thread will be started after a delay of 512 clocks from the first.
Also __syncthread() is introduced to avoid this delaying problem.
Is this true?
I believed that the limit of 32 warps, 32 threads can be executed on the same clock-phase.