The compiler will put variables declared in global or device functions into registers unless (1) the number of registers gets too big or (2) the variable is an array that you access with an index. Variables that don’t fit into registers for one of these two reasons are “spilled” to what the guide calls “local memory”, which is really just device memory, but organized so that every thread gets its own private piece.
Variables declared locally to functions will be put into registers. You can declare a global variable “device int x” outside of the kernel. To initialize such variables from the host, you need to use cudaMemcpyToSymbol.
cudaMemcpy will wait for all previous kernel calls to complete before writing memory on the GPU, so this is expected behavior. If one of your blocks detects an exit condition, it could set the device variable to prevent future blocks from starting up, thus causing an early exit.
I don’t have any suggestions on stopping runaway kernels from the host, except to say that in my experience on linux, even infinite loop kernels are killed via Cntrl-C on a machine with no display.