Strange memory corruption

I recently met a strange problem using cuda.
I launched a kernel with the following pseudo code

input   = cudaMalloc(....)
output = cudaMalloc(...)
vector<int> host(n,0);
cudaMemcpyAsync(host, output, n * sizeof(T), cudaMemcpyDeviceToHost, stream);

for(int i=0;i<n;i++){
    cout << output[i] << endl;


void functionBOnHost(){
    //option A:
    vector<int> cpu2output(n,0);
    cudaMemcpyAsync(cpu2output, output, n * sizeof(T), cudaMemcpyDeviceToHost, stream);

   //option B:
  replace kernelB with CPU code.

the strangest part is if i have kernelA with CPU code, i get correct solution when i print using cout.
Or If I have kernelA with kernelB or without kernelB, my print would be wrong and some of my result becomes zero.
I used cuda gdb with memory check,set cuda memcheck on, the program exited normally.

I thought it might be a synchronization issue but it confuses me since even with one kernel launch without the cpu code, it gives me different result.
i added cudaDeviceSynchronize(); after every kernel code, it does seem to help…
I think it relates to synchronization since the later part of my code affects my kernel memory that launched before this, i.e. my functionBOnHost affects the behavior and the result of my kernelA or its memory space…

does anyone know what might be causing this?

Help is greatly appreciated!