CudaStreamSynchronize not working properly

Hi.
I want to generate histograms for a total of 77 pictures and working on a Jetson Xavier. Therefore I created for each picture a stream and allocated a histogram buffer in the unified memory. Afterwards I atttach the histogram buffer to the corresponding stream as can you see here:

for(int bufNo = 0; bufNo<MAXFRAME; bufNo++)
{
   gpuErrchk(cudaStreamCreate(&mCudaStreams[bufNo]));
   gpuErrchk(cudaMallocManaged((void**)&mHorHisBuf[bufNo], sizeof(unsigned int) * mXsize* mYsize));
   gpuErrchk(cudaStreamAttachMemAsync(mCudaStreams[bufNo], mHorHisBuf[bufNo], sizeof(unsigned int)* mXsize* mYsize, cudaMemAttachSingle));
}

The following snippet shows how the kernels for filling the histograms are launched. When the first histogram buffer is filled I want to continue on the CPU to find the maximum, this is why cudaStreamSynchronize is called.

for(int ImgNum=0; ImgNum<MAXFRAME; ImgNum++)
{		
	// Clear Buffer
	cudaMemsetAsync(mHorHisBuf[ImgNum], 0, sizeof(unsigned int)*mHorHisBufSx*mHorHisBufSy, mCudaStreams[ImgNum]);		
	
    // Launch kernel
	kernel_fillHorizontHistogram<<<grid, block, 0, mCudaStreams[ImgNum]>>>(mHorHisBuf[ImgNum], ...);		
}


for(int idx=0; idx<mNumImg; idx++)
{  	
    gpuErrchk(cudaStreamSynchronize(mCudaStreams[idx]));
    findMax(mHorHisBuf[idx]);
    ....
		
}	

Profiling the program with nvvp does not show the expected result as can be seen in the figure. cudaStreamSynchronize waits much longer than the execution time of the first stream.

Changing cudaMemAttachSingle to cudaMemAttachHost in cudaStreamAttachMemAsync gave me the expected result. Also the result is the same as with the first version. Nevertheless cuda-memcheck reports a lot of errors because of invalid write inside the kernel.

I have the following questions:

  • Does someone understand the behaviour of cudaStreamSynchronize in the first version?
  • Why is the result in the second version the same as in the first, although cuda-memcheck is reporting invalid write?
  • How to get the timing of the first version without errors from cuda-memcheck?

Thank you for your help.

there is not much I can do without a complete code. However the problems with attach host don’t surprise me. The docs say:

Memory cannot be accessed by any stream on any device

I also don’t have a Jetson Xavier to work on. You might get better responses on the Jetson Xavier forum.