cudaMemcpyAsync Problem

Hi, today I’m facing a weird error:

i have several streams and several buffers

I’m doing something like this:

d_bufx_x is allocated with cudaMalloc, h_bufx_x with cudaHostAlloc

kernel<<<stream1>>>(d_buf1_1, d_buf1_2);

cudamemcpyAsync(h_buf1_1, d_buf1_1, stream1);

cudamemcpyAsync(h_buf1_2, d_buf1_2, stream1);

kernel<<<stream2>>>(d_buf2_1, d_buf2_2);

cudamemcpyAsync(h_buf2_1, d_buf2_1, stream2);

cudamemcpyAsync(h_buf2_2, d_buf2_2, stream2);

...

cudaDeviceSynchronize();

Now i would expect that all data is calculated and copied, but for any reason, h_buf2_2 is missing some integrity

(it only consists of 6 int values, and value [1] to [3] are set with memory from somewhere else)

I get neither cudaErrors nor any exception, everything seems to be fine…

For consideration: it works without problems with cudaMemcpy instead of cudaMemcpyAsync.

What am i doing wrong?

Add [font=“Courier New”]count[/font] and [font=“Courier New”]kind[/font] arguments to the function call, so that the [font=“Courier New”]stream1[/font] and [font=“Courier New”]stream2[/font] arguments don’t get misinterpreted as the lengths of the memory block to copy. cudaMemcpyAsync() takes 5 arguments if a stream is specified.

I skipped those in the example code in the first post, for simplification. My real code contains these values ofc ;)

Can you post some real code to enable us to find real mistakes?