Perhaps you should study the whole sample code, to understand how it works, rather than just the kernel.
That would be true if only one thread were running. But you have multiple threads running in parallel, and in particular you have threads in a warp executing in lockstep. What you are observing is the result after 32 threads have completed the work, specifically the first warp. You’ll need to understand how a GPU executes code. The debugger does not isolate a single thread for you. When you allow a thread to execute this line:
at a minimum, it will not be one single thread executing that line of code, it will be all the active threads in the warp.
If you would like to see the behavior of just a single thread, in isolation, one way to do that would be to modify that kernel launch, so that only one thread is executing.