Question about the order of call function in GPU

I am working on GTX1070ti and recently I have a problem. For example in a do-loop I have “A”, “B” two global void function. Function A will collect some information and these information will be used and finally cleaned in B. The launched thread number of A,B are equal, but thread number may vary in each loop(0 to 30000 data are possible).
My question is: Same loops have only one data to be proceeded. I launch function A and B(both are 1 thread) in sequence. Will GPU execute also function A,B in sequence? or I have to use cudaDeviceSynchronize() to enforce the sequence?
And I am wondering, if I introduce cudaStream_t and let function A and B are the same Stream. Is it sufficient to ensure that GPU will execute function A, B in sequence?i

kernels launched into the default stream will serialize. There is no need to put a cudaDeviceSynchronize() in between.

[url]Programming Guide :: CUDA Toolkit Documentation

Thank you for the reply. This is really helpful.
Yesterday I added a cudaDeviceSynchronize(), then the program was fine and data was unchanged.
Before doing this, I have used cuda-memcheck to check memory and no out of memory happened. The data were also checked and correct. So I am doubting, is this error due to my code, because my program is composed by several Obj files and A and B are in the different Obj file.
Thanks again!