cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ?

I have a code like
myKernel<<<…>>>(srcImg, dstImg)
cudaMemcpy2D(…, cudaMemcpyDeviceToHost)
where the CUDA kernel computes an image ‘dstImg’ (dstImg has its buffer in GPU memory) and the cudaMemcpy2D fn. then copies the image ‘dstImg’ to an image ‘dstImgCpu’ (which has its buffer in CPU memory).
Do I have to insert a ‘cudaDeviceSynchronize’ before the ‘cudaMemcpy2D’ in order to ensure that the kernel has finished writing the values to ‘dstImg’ GPU memory buffer ? From http://stackoverflow.com/questions/11888772/when-to-call-cudadevicesynchronize it looks like I do NOT have to insert the ‘cudaDeviceSynchronize’ after the kernel, but I am not 100% sure.

CUDA operations issued to the same stream always serialize.

Assuming your kernel is issued to the default stream, there is no possibility for a cudaMemcpy2D operation to begin before that kernel has completed.

All global memory updates made by a kernel are guaranteed to be visible to any actor, when the kernel completes.

Assuming all these operations are to the default stream, there is no need for a cudaDeviceSynchronize() after a kernel launch, before a cudaMemcpy (or cudaMemcpy2D) operation.

thx for the info. What if ‘myKernel’ is not launched on the default stream (which has a special behaviour) ?

As long as kernel and cudaMemcpy() are issued on the same stream (whether default or not), no synchronization is necessary. Operations within the same stream are always executed in order, only operations from different streams may execute in parallel.

But ‘cudaMemcpy’ (and ‘cudaMemcpy2D’) does not have a ‘stream’ parameter, only the ‘async’ variants have …

Correct. So it is not possible to issue them to anything other than the default stream. A cudaMemcpy operation is always issued to the default stream.

Since you didn’t provide anything that looked like actual code, it’s always possible that someone says “my code is like … cudaMemcpy” when really what they mean is “I am actually using cudaMemcpyAsync”

So my answer was a general answer, and some of your subsequent questions are answered by my very first statement:

“CUDA operations issued to the same stream always serialize.”

If, on the other hand, a cuda kernel is launched into one stream and a cudaMemcpyAsync operation is issued into another stream (assuming neither is the default stream with default semantics) then there are no guarantees about the order of execution.

Hi
I’m confused with the cudaMemcpy2D, as i want to copy a matrix of 20*4 form device to host,but the data is not stable, sometimes one row is zero after copy. my code is like:

void region_search()
{
    dim3 threadsPerBlock(32, 8);
    dim3 Grid((512+threadsPerBlock.x-1)/threadsPerBlock.x, (424+threadsPerBlock.y-1)/threadsPerBlock.y);  

    int *d_num, h_num;
    unsigned int *d_output_matrix;
    size_t pitch;
    cudaMalloc((void**)&d_num,sizeof(int));
    cudaMallocPitch((void**)&d_output_matrix, &pitch, 4*sizeof(unsigned int), 20);
    region_search_kernel<<< Grid, threadsPerBlock, sizeof(unsigned char)*(32*8) >>>(d_num,d_output_matrix,pitch);
    cudaMemcpy(&h_num,d_num,sizeof(int),cudaMemcpyDeviceToHost);
    unsigned int h_output1[h_num][4];
    cudaMemcpy2D(h_output1,4*sizeof(unsigned int),d_output_matrix,pitch,sizeof(unsigned int)*4,h_num,cudaMemcpyDeviceToHost);

    for (int x=0;x<h_num;x++)
    {
      printf("output after copy %d %d %d %d %d\n", x, h_output1[x][0],h_output1[x][1],h_output1[x][2],h_output1[x][3]);
    }
    cudaFree(d_num);
    cudaFree(d_output_matrix);
    free(h_num);
    free(h_output1);
}
__global__ void region_search_kernel(int *d_num,unsigned int *output, size_t pitch)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;//0--511
    int j = threadIdx.y + blockIdx.y * blockDim.y;//0-423
    int idex_s=threadIdx.x+threadIdx.y*32;//index in shared memory
    int id_block=blockIdx.y*16+blockIdx.x;
    __shared__ unsigned int data_x[32*8];
    __shared__ unsigned char data_y[32*8];
    data_x[idex_s]=100;//initialize the array in shared memory
    data_y[idex_s]=100;
    __syncthreads();
      //there ignore some unimportant code
    if ( j < 424 )
    {
        if ( i < 512)
        {
           //there ignore some code to compute the data_x data_y
           if (idex_s==0)
              {
                  if (data_x[idex_s]<100)
                  {
                      
                      unsigned int * row=(unsigned int *)((char*)output+(num[0]-1)*pitch);
                      row[0]=data_x[0]+(blockIdx.x)*(blockDim.x);//x的最小值
                      row[1]=data_x[num_[0]-1]+(blockIdx.x)*(blockDim.x);
                      row[2]=data_y[0]+blockIdx.y*blockDim.y;
                      row[3]=data_y[num_[0]-1]+blockIdx.y*blockDim.y;
                      printf("output in kernel:id block x_min x_max y_min y_max %d %d %d %d %d\n",id_block,row[0],row[1],row[2],row[3]);
                  }
              }
        }
     }
}

my problem is: most of time the printf are the same, and sometimes the printf in the host have a row of 0!
such as this:
you can only see the last 4 elements and the rows maybe not the same order.
normal as followed:

output in kernel:id block x_min x_max y_min y_max 503 224 254 253 255
output in kernel:id block x_min x_max y_min y_max 502 223 223 255 255
output in kernel:id block x_min x_max y_min y_max 519 224 255 256 263
output in kernel:id block x_min x_max y_min y_max 518 217 223 256 263
output in kernel:id block x_min x_max y_min y_max 520 256 257 257 263
output in kernel:id block x_min x_max y_min y_max 534 216 223 264 271
output in kernel:id block x_min x_max y_min y_max 535 224 255 264 271
output in kernel:id block x_min x_max y_min y_max 536 256 257 264 271
output in kernel:id block x_min x_max y_min y_max 550 217 223 272 274
output in kernel:id block x_min x_max y_min y_max 551 224 254 272 279
output in kernel:id block x_min x_max y_min y_max 567 228 248 280 287
output after copy 0 223 223 255 255//correspont to id_block 502
output after copy 1 224 254 253 255//correspont to id_block 503
output after copy 2 217 223 256 263//correspont to id_block 518
output after copy 3 224 255 256 263//correspont to id_block 519
output after copy 4 256 257 257 263//correspont to id_block 520
output after copy 5 216 223 264 271//correspont to id_block 534
output after copy 6 224 255 264 271//correspont to id_block 535
output after copy 7 256 257 264 271//correspont to id_block 536
output after copy 8 217 223 272 274//correspont to id_block 550
output after copy 9 224 254 272 279//correspont to id_block 551
output after copy 10 228 248 280 287//correspont to id_block 567

unnormal:

output in kernel:id block x_min x_max y_min y_max 502 223 223 255 255
output in kernel:id block x_min x_max y_min y_max 503 224 254 253 255
output in kernel:id block x_min x_max y_min y_max 520 256 257 257 263
output in kernel:id block x_min x_max y_min y_max 518 217 223 256 263
output in kernel:id block x_min x_max y_min y_max 519 224 255 256 263
output in kernel:id block x_min x_max y_min y_max 535 224 255 264 271
output in kernel:id block x_min x_max y_min y_max 536 256 257 264 269
output in kernel:id block x_min x_max y_min y_max 534 216 223 264 271
output in kernel:id block x_min x_max y_min y_max 551 224 254 272 279
output in kernel:id block x_min x_max y_min y_max 550 217 223 272 272
output in kernel:id block x_min x_max y_min y_max 567 227 249 280 287
output after copy 0 223 223 255 255//correspont to id_block 502
output after copy 1 224 254 253 255//correspont to id_block 503
output after copy 2 217 223 256 263//correspont to id_block 518
output after copy 3 224 255 256 263//correspont to id_block 519
output after copy 4 256 257 257 263//correspont to id_block 520
output after copy 5 224 255 264 271//correspont to id_block 535
output after copy 6 216 223 264 271//correspont to id_block 534
output after copy 7 256 257 264 269//correspont to id_block 536
output after copy 8 0 0 0 0//???????????????????????????????????????????
output after copy 9 217 254 272 279//correspont to id_block 550
output after copy 10 227 249 280 287//correspont to id_block 567

at the beginning i think it’s the synchronization between blocks, but when i add "cudaDeviceSynchronize()"after the kernel, the problem also occured.
Then,i just use 1D array to copy the data to the host, the problem also occured.
Last, i add some printf in the end of the kernel,it need some time,the problem disappeared!
So,i think it’s the synchronize problem but i don’t why i add cudaDeviceSynchronize() failed. and i don’t want to spent more time,so, i wonder what’s the problem? how to resolved it?

Try running your code with cuda-memcheck and also with cuda-memcheck --tool racecheck

If there are any problems reported, you’ll need to sort those out.

Hi,
But i don’t know how to use the cuda-memcheck in NVIDIA TX1! As i installed cuda by native!

OK, after i add the /usr/local/cuda/bin to the PATH i can use the cuda-gdb and cuda-memcheck!
and when i use the cuda-memcheck and cuda-memcheck --tool racecheck there have no problem, and the phenomenon i mentioned above disappeared. so strange!

If the problem changes when run under cuda-memcheck, it may still be a race condition. For example you may have a race condition in global memory, which cuda-memcheck will not be able to detect. It’s impossible to tell since you’ve chopped out a bunch of your code.

I think its safe to say you have a coding defect, possibly a race condition, that you will need to debug.
It’s not clear to me why you posted on this thread or believe that it may involve host synchronization, but for code launched to the same stream (as yours appears to be), no additional synchronization is needed between a kernel call and a cudaMemcpy operation after the kernel call, as already stated several times in this thread.

Hi, I am doing device to device cudaMemcpy. In my code I have created two threads.
Thread 1 - Copies data from device memroy to another device memory
Thread 2 - Operates on this copied memory.

On CPU program, how should I come to know that Thread-1 has completed the memcpy job before I instruct another thread to process on the latest data and not on the previously holded data / junk data in the buffer.

As per http://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html, “For transfers from device memory to device memory, no host-side synchronization is performed.”, so can you please help me in understanding how to handle this situation. If you point out to any reference code, that would be helpful.

Thanks,
Tushar

Issue those operations in the same thread, to the same stream. Even though no host side synchronization is performed, operations issued in the same stream will always serialize

Otherwise you’ll need to use inter-thread synchronization, which is a function of your threading system and not CUDA.

Thanks txbob for reply, but I still do not understand how would I know that the job is completed by first thread before I signal another thread for further operation. Is there any call back or event I would get after device to device cudamemcpy gets completed? do you know any such test example (muti-threaded + device to device memcpy) which can be referred to?

Issue a cuda event into the same stream where you issued the cuda memcpy, then issue a cuda stream wait event operation. That will indicate completion. Or you could use a stream callback.

If it were me, I would put the operations in the same stream. Simple. Problem solved.

Since your copy is from device memory to device memory, I assume the operation you are waiting to issue is a kernel call. Processing data in host code couldn’t depend on a device->device memory copy.

Thanks txbob for your inputs. It worked with your suggestion. :)