By following https://devtalk.nvidia.com/default/topic/1051799/jetson-tx2/cudastreamattachmemasync-race-condition-in-tx2/ (still waiting for answer of AastaLLL), I found another issue with cudaStreamAttachMemAsync on TX2.
This is my original source code: https://workupload.com/file/rFYtbM3v
I tried to replace my kernel with cudaMemCopyAsync so given:
cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuInput, 0, cudaMemAttachGlobal)); remap<<<grid, block, 0, stream>>>(gpuInput, gpuOutput, mapX, mapY, width, height); cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuOutput, 0, cudaMemAttachHost));
cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuInput, 0, cudaMemAttachGlobal)); cudaSafeCall(cudaMemcpyAsync(gpuInput, gpuOutput, outputMemsize, cudaMemcpyDeviceToHost, stream)); cudaSafeCall(cudaStreamAttachMemAsync(stream, gpuOutput, 0, cudaMemAttachHost));
then tried to test on both linux PC and TX2. On TX2, it always returns first image data regardless of the new image data of gpuInput. This issue doesn’t happen on my linux PC.
Also, upon a trick that is mentioned in https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#effective-usage-unified-memory:
An additional cudaStreamSynchronize(NULL) call can be added after the matrixMul kernel code to avoid callback threads that cause unpredictability in a cudaStreamAttachMemAsync() call.
, I added a cudaStreamSynchronize(stream) after my kernel (remap) as well as (cudaMemcpyAsync) in each sample respectively, but all resulted same issue (the first one returned dead-pixels while the second one returned data of first image only).
Is there any configuration that can solve this problem on TX2?
Since cudaStreamAttachMemAsync is a trick to speed up the data flow of UM, it is supposed to be working without any problem. I was also thinking of a configuration that can help to make cudaStreamAttachMemAsync works as expected, but I couldn’t find out.
Please help me!