__syncthreads() is an instruction called from a kernel and it means “synchronize all threads in this block”.
I am talking about cudaThreadSynchronize(), a function called in the host code that stands for “synchronize all threads across the grid”.
This is needed because kernels are launched asynchronously, the control is given back to the host right after the kernel is launched, not after it finishes. If you want to call many kernels in sequence, to make sure the previous iteration is complete you should call cudaThreadSynchronize().
for (int i = 0; i < 100; i++)
for (int j = 0; j < 1000; q++)
This isn’t needed if the kernels are independent but in your case you need to be sure kernel1 and 2 finish because they compute stuff for kernel 3. And I presume you also want to wait for kernel3 to finish before tasking the GPU with another launch of kernel3 because your data might get mixed up.