cudaStreamAttachMemAsync race condition in TX2

Hi all,

I have a simple code for image copy using CUDA.

__global__ remap(uchar3 *a, uchar3 *b, int w, int h) {
	unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
	unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

	b[y * w + x] = a[y * w + x];
}

int main() {
	uchar3 *a = <load bitmap data>;
	uchar3 *b;
	cudaMallocManaged((void **)&b, <bytes of bitmap data>);
	int w = 1600, h = 1200;

	dim3 block = dim3(BLOCK_SIZE_W, BLOCK_SIZE_H);
	dim3 grid = dim3(divUp(w, block.x), divUp(h, block.y));

	cudaStream_t stream;
	cudaStreamCreate(&stream);

	cudaStreamAttachMemAsync(stream, a, 0, cudaMemAttachGlobal);
		
	remap<<<grid, block, 0, stream>>>(a, b, w, h);
	
	cudaStreamAttachMemAsync(stream, b, 0, cudaMemAttachHost);

        cudaDeviceSynchronize();
}

Following https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#effective-usage-unified-memory, I use cudaStreamAttachMemAsync to improve performance of unified memory implementation. On my PC, it ran well and returned correct outputs. However, in TX2, I got resulted images with trash pixels. Is this a bug of CUDA on TX2?

Here is my samples
processed on my PC:

processed on TX2:

Please help!

Hi,

Are you also using GPU with the PC version?

The trash pixel appears with certain special order.
It’s recommended to check thread/block index first.

Thanks.

Hi,

Yes, I’m also using GPU with PC version. (same code on both platforms)
If I remove cudaStreamAttachMemAsync, it doesn’t generate trash pixel on TX2. I have no idea what I should check for, is there any specific article or document to help me understand this problem?

Thank you!

Hi,

You can find some information in our CUDA document:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM

This issue looks weird to me. Could you help to check the return status of both cudaStreamAttachMemAsync?
I will check if there is any limitation on the Jetson platform.

Thanks.

Hi AastaLLL,

I’ve tried again, cudaStreamAttachMemAsync returned cudaSuccess on both platforms. The trash pixels appear if I set cudaStreamAttachMemAsync(stream, b, 0, cudaMemAttachHost); to get my output only, it doesn’t happen with cudaStreamAttachMemAsync(stream, a, 0, cudaMemAttachGlobal);.

I suspect that UM pointer was attached to host while some threads were active, then those threads couldn’t write results to memory due to lock-state (limit to host-access only after mem pointer is attached). I’m going to research your given document, if you have another suggestion, please drop me.

Thanks.

Hi,

Would you mind to share a complete source to reproduce this issue?
We want to check this issue with our internal team.

Thanks.

Hi,

Here is my sample code http://s000.tinyupload.com/?file_id=00535759807175788814. This source is using opencv3.4.4 for image load/save utility. To build executable file, please cd to simple_remap/Debug => run make => launch app by “./simple_remap” then check result in simple_remap/Debug/images.

Thanks.

Hi,

Thanks for your sample.
We will check this and update more information with you later.

Hi,

Hope to get your update soon.

Hi,

When reproducing this issue, we keep meeting several library link issues.
It looks like your makefile is generated from cmake. (please correct me if I’m wrong)

Would you mind to share the CMakeList.txt with us so we can general a makefile based on our environment?

Thanks.

Hi,

My Makefile was generated by Nsight Eclipse Edition. Please share your build log, I will try to help.
Btw, my opencv was built with cmake flags:

cmake -D CMAKE_BUILD_TYPE=RELEASE \
	-D CMAKE_INSTALL_PREFIX=/usr/local \
	-D OPENCV_ENABLE_NONFREE=ON \
	-D WITH_CUDA=ON \
	-D CMAKE_LIBRARY_PATH=/usr/local/cuda-10.0/lib64/stubs \
        -D CUDA_FAST_MATH=1 \
	-D WITH_CUBLAS=1 \
	-D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
	-D WITH_OPENGL=ON \
	-D WITH_NVCUVID=ON \
	-D WITH_LIBV4L=ON \
	-D WITH_FFMPEG=ON \
	-D WITH_GSTREAMER=ON \
	-D BUILD_opencv_cudacodec=OFF \
	-D BUILD_NEW_PYTHON_SUPPORT=ON \
	-D BUILD_opencv_python2=ON \
	-D HAVE_opencv_python2=ON \
	-D PYTHON2_EXECUTABLE=/usr/bin/python2.7 \
	-D PYTHON_INCLUDE_DIR=/usr/include/python2.7 \
	-D PYTHON_INCLUDE_DIR2=/usr/include/x86_64-linux-gnu/python2.7 \
	-D PYTHON_LIBRARY=/usr/lib/x86_64-linux-gnu/libpython2.7.so \
	-D BUILD_EXAMPLES=OFF ..

Only opencv was included for linking, probably your compilation issues come from missing of some parts of opencv.

I’m always here for your call.

Best regards,
Thinh.

Hi,

We find there are several x86_64-linux-gnu link inside the makefile.
Suppose the sample is applied on the TX2, is it any reason to link the x86 library?

Or could you update a version for the aarch64 link for us?
Thanks.

Hi,

Sorry about this mistake. My sample code was extracted from private project which included some irrelevant configurations for compiling on TX2. I’ve just cleaned it, and tried to compile on both platforms. Please check again!.

https://workupload.com/file/rFYtbM3v

I also attached more images for testing, the trash pixels issue only occurs after second image is processed.

Thinh

Hi,

Thanks for the update.
We will try to reproudce this and update information with you.

Hi,

We test the new sample today but still meet the same openCV error:

Building file: ../Recorders/Recorder.cpp
Invoking: NVCC Compiler
/usr/local/cuda-10.0/bin/nvcc -DDEBUG=1 -I/usr/local/include -G -g -O0 -std=c++11 -gencode arch=compute_61,code=sm_61  -odir "Recorders" -M -o "Recorders/Recorder.d" "../Recorders/Recorder.cpp"
/usr/local/cuda-10.0/bin/nvcc -DDEBUG=1 -I/usr/local/include -G -g -O0 -std=c++11 --compile  -x c++ -o  "Recorders/Recorder.o" "../Recorders/Recorder.cpp"
Finished building: ../Recorders/Recorder.cpp
 
Building file: ../Recorders/SequenceImagesRecorder.cpp
Invoking: NVCC Compiler
/usr/local/cuda-10.0/bin/nvcc -DDEBUG=1 -I/usr/local/include -G -g -O0 -std=c++11 -gencode arch=compute_61,code=sm_61  -odir "Recorders" -M -o "Recorders/SequenceImagesRecorder.d" "../Recorders/SequenceImagesRecorder.cpp"
/usr/local/cuda-10.0/bin/nvcc -DDEBUG=1 -I/usr/local/include -G -g -O0 -std=c++11 --compile  -x c++ -o  "Recorders/SequenceImagesRecorder.o" "../Recorders/SequenceImagesRecorder.cpp"
../Recorders/SequenceImagesRecorder.cpp: In member function ‘virtual void SequenceImagesRecorder::record()’:
../Recorders/SequenceImagesRecorder.cpp:22:53: error: ‘cv::String::String(int)’ is private within this context
  cv::VideoCapture cap(this->m_cameraId, cv::CAP_V4L2);
                                                     ^
In file included from /usr/local/include/opencv2/core/base.hpp:58:0,
                 from /usr/local/include/opencv2/core.hpp:54,
                 from /usr/local/include/opencv2/opencv.hpp:52,
                 from ../Recorders/SequenceImagesRecorder.cpp:3:
/usr/local/include/opencv2/core/cvstd.hpp:556:5: note: declared private here
     String(int); // disabled and invalid. Catch invalid usages like, commandLineParser.has(0) problem
     ^~~~~~
Recorders/subdir.mk:21: recipe for target 'Recorders/SequenceImagesRecorder.o' failed
make: *** [Recorders/SequenceImagesRecorder.o] Error 1

It looks like there is some difference in the OpenCV ability between us.
Here is the script we use for building OpenCV 3.4:
https://github.com/AastaNV/JEP/blob/master/script/install_opencv3.4.0_TX2.sh

Could you check if there is anything missing?
Thanks.

I was using opencv3.4.4. If you use opencv3.4.0, please just remove the apiReference so as given:

cv::VideoCapture cap(this->m_cameraId, cv::CAP_V4L2);

would become

cv::VideoCapture cap(this->m_cameraId);

in opencv3.4, cv::CAP_V4L2 wasn’t exposed to use explicitly.

Thanks.

Hi AastaLLL,

Is there any news?

Hi,

We are trying to reproduce this issue in our environment.
It may take times due to our limited bandwidth.

Will update more information with you once we got a further update.
Thanks.

Hi,

Thanks for updating me.

Hi,

We can reproduce this issue and redirect this to our internal team.
Thanks.