another issue of cudaStreamAttachMemAsync on TX2

tpham · June 20, 2019, 6:38am

Hi supporters,

By following https://devtalk.nvidia.com/default/topic/1051799/jetson-tx2/cudastreamattachmemasync-race-condition-in-tx2/ (still waiting for answer of AastaLLL), I found another issue with cudaStreamAttachMemAsync on TX2.

This is my original source code: simple_remap.zip

I tried to replace my kernel with cudaMemCopyAsync so given:

cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuInput, 0, cudaMemAttachGlobal));

		remap<<<grid, block, 0, stream>>>(gpuInput, gpuOutput, mapX, mapY, width, height);

		cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuOutput, 0, cudaMemAttachHost));

become

cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuInput, 0, cudaMemAttachGlobal));

		cudaSafeCall(cudaMemcpyAsync(gpuInput, gpuOutput, outputMemsize, cudaMemcpyDeviceToHost, stream));

		cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuOutput, 0, cudaMemAttachHost));

then tried to test on both linux PC and TX2. On TX2, it always returns first image data regardless of the new image data of gpuInput. This issue doesn’t happen on my linux PC.
Also, upon a trick that is mentioned in CUDA for Tegra :: CUDA Toolkit Documentation
"Note:

An additional cudaStreamSynchronize(NULL) call can be added after the matrixMul kernel code to avoid callback threads that cause unpredictability in a cudaStreamAttachMemAsync() call.
"
, I added a cudaStreamSynchronize(stream) after my kernel (remap) as well as (cudaMemcpyAsync) in each sample respectively, but all resulted same issue (the first one returned dead-pixels while the second one returned data of first image only).

Is there any configuration that can solve this problem on TX2?
Since cudaStreamAttachMemAsync is a trick to speed up the data flow of UM, it is supposed to be working without any problem. I was also thinking of a configuration that can help to make cudaStreamAttachMemAsync works as expected, but I couldn’t find out.

Please help me!

AastaLLL · June 20, 2019, 8:42am

Hi,

Thanks for the experiment.
It looks like this issue is related to the launch time of the CPU/GPU.

Our internal team is still checking this issue.
We will also pass this information to them.

Thanks.

tpham · June 20, 2019, 9:09am

Hi,

Thanks for your supporting. I hope to get an update soon.

Thinh

tpham · August 7, 2019, 9:14am

Hi AastaLLL,

Thanks for pointing out my mistake. gpuInput and gpuOutput should be swapped in cudaMemcpyAsync call.

Topic		Replies	Views
cudaStreamAttachMemAsync race condition in TX2 Jetson TX2	35	1700	October 18, 2021
can't achieve cudaMemcpyAsync and kernel concurrency Jetson TX1	3	697	October 18, 2021
CudaStreamSynchronize not working properly CUDA Programming and Performance	1	585	November 19, 2022
11.2 > cudaMemPool_t and Peer2Peer CUDA Programming and Performance	4	1067	January 14, 2021
Two streams are not working asynchronously CUDA Programming and Performance tensorrt , cuda , jetson-inference	7	747	November 20, 2021
Crash after cudaStreamAttachMemAsync Jetson TX2	9	969	October 18, 2021
Questions about efficient memory management for TensorRT on TX2 CUDA Programming and Performance	8	2019	October 12, 2021
Unified Memory Access using Jetson TX2 Jetson TX2	5	2338	October 18, 2021
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1770	June 23, 2010
Unable to achieve concurrency in kernel launches CUDA Programming and Performance	2	877	February 12, 2016

another issue of cudaStreamAttachMemAsync on TX2

Related topics