cudaMemcpyAsync Problem

Hi, today I’m facing a weird error:

i have several streams and several buffers

I’m doing something like this:

d_bufx_x is allocated with cudaMalloc, h_bufx_x with cudaHostAlloc

kernel<<<stream1>>>(d_buf1_1, d_buf1_2);

cudamemcpyAsync(h_buf1_1, d_buf1_1, stream1);

cudamemcpyAsync(h_buf1_2, d_buf1_2, stream1);

kernel<<<stream2>>>(d_buf2_1, d_buf2_2);

cudamemcpyAsync(h_buf2_1, d_buf2_1, stream2);

cudamemcpyAsync(h_buf2_2, d_buf2_2, stream2);

...

cudaDeviceSynchronize();

Now i would expect that all data is calculated and copied, but for any reason, h_buf2_2 is missing some integrity

(it only consists of 6 int values, and value [1] to [3] are set with memory from somewhere else)

I get neither cudaErrors nor any exception, everything seems to be fine…

For consideration: it works without problems with cudaMemcpy instead of cudaMemcpyAsync.

What am i doing wrong?

Add [font=“Courier New”]count[/font] and [font=“Courier New”]kind[/font] arguments to the function call, so that the [font=“Courier New”]stream1[/font] and [font=“Courier New”]stream2[/font] arguments don’t get misinterpreted as the lengths of the memory block to copy. cudaMemcpyAsync() takes 5 arguments if a stream is specified.

I skipped those in the example code in the first post, for simplification. My real code contains these values ofc External Image

Can you post some real code to enable us to find real mistakes?