another issue of cudaStreamAttachMemAsync on TX2

Hi supporters,

By following (still waiting for answer of AastaLLL), I found another issue with cudaStreamAttachMemAsync on TX2.

This is my original source code:

I tried to replace my kernel with cudaMemCopyAsync so given:

cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuInput, 0, cudaMemAttachGlobal));

		remap<<<grid, block, 0, stream>>>(gpuInput, gpuOutput, mapX, mapY, width, height);

		cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuOutput, 0, cudaMemAttachHost));


cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuInput, 0, cudaMemAttachGlobal));

		cudaSafeCall(cudaMemcpyAsync(gpuInput, gpuOutput, outputMemsize, cudaMemcpyDeviceToHost, stream));

		cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuOutput, 0, cudaMemAttachHost));

then tried to test on both linux PC and TX2. On TX2, it always returns first image data regardless of the new image data of gpuInput. This issue doesn’t happen on my linux PC.
Also, upon a trick that is mentioned in

An additional cudaStreamSynchronize(NULL) call can be added after the matrixMul kernel code to avoid callback threads that cause unpredictability in a cudaStreamAttachMemAsync() call.
, I added a cudaStreamSynchronize(stream) after my kernel (remap) as well as (cudaMemcpyAsync) in each sample respectively, but all resulted same issue (the first one returned dead-pixels while the second one returned data of first image only).

Is there any configuration that can solve this problem on TX2?
Since cudaStreamAttachMemAsync is a trick to speed up the data flow of UM, it is supposed to be working without any problem. I was also thinking of a configuration that can help to make cudaStreamAttachMemAsync works as expected, but I couldn’t find out.

Please help me!


Thanks for the experiment.
It looks like this issue is related to the launch time of the CPU/GPU.

Our internal team is still checking this issue.
We will also pass this information to them.



Thanks for your supporting. I hope to get an update soon.


Hi AastaLLL,

Thanks for pointing out my mistake. gpuInput and gpuOutput should be swapped in cudaMemcpyAsync call.