Strange memory corruption

I recently met a strange problem using cuda.
I launched a kernel with the following pseudo code

input   = cudaMalloc(....)
output = cudaMalloc(...)
vector<int> host(n,0);
cudaMemcpyAsync(host, output, n * sizeof(T), cudaMemcpyDeviceToHost, stream);

for(int i=0;i<n;i++){
    cout << output[i] << endl;


void functionBOnHost(){
    //option A:
    vector<int> cpu2output(n,0);
    cudaMemcpyAsync(cpu2output, output, n * sizeof(T), cudaMemcpyDeviceToHost, stream);

   //option B:
  replace kernelB with CPU code.

the strangest part is if i have kernelA with CPU code, i get correct solution when i print using cout.
Or If I have kernelA with kernelB or without kernelB, my print would be wrong and some of my result becomes zero.
I used cuda gdb with memory check,set cuda memcheck on, the program exited normally.

I thought it might be a synchronization issue but it confuses me since even with one kernel launch without the cpu code, it gives me different result.
i added cudaDeviceSynchronize(); after every kernel code, it does seem to help…
I think it relates to synchronization since the later part of my code affects my kernel memory that launched before this, i.e. my functionBOnHost affects the behavior and the result of my kernelA or its memory space…

does anyone know what might be causing this?

Help is greatly appreciated!

I think you might be misunderstanding the CUDA programming model and streams. You’re launching KernelA and KernelB in the default stream, which is blocking on the GPU. But remember kernel launches are not blocking on the CPU because they have an asynchronous execution. Let’s examine the top part of your code.

You launch KernelA and control is immediately return to the code, where you create and initialize host. Then launch cudaMemcpyAsync, again is asynchronous to the host. Therefore, control is return immediately to the host where you try to print return that probably haven’t returned to the host.

Please read Asynchronous Concurrent Execution for more information.