Synchronization for CUBLAS

Hi all,

The CUBLAS documentation mentions that we need synchronization before reading a scalar result:

“Also, the few functions that return a scalar result, such as amax(), amin, asum(), rotg(), rotmg(), dot() and nrm2(), return the resulting value by reference on the host or the device. Notice that even though these functions return immediately, similarly to matrix and vector results, the scalar result is ready only when execution of the routine on the GPU completes. This requires proper synchronization in order to read the result from the host.”

How exactly should we synchronize? Do we need synchronization if we don’t use streams? I have been looking for an example on NVIDIA’s CUDA documentation but could not find one.

But in the conjugate gradient example (http://docs.nvidia.com/cuda/cuda-samples/index.html#conjugategradient) provided by NVIDIA, there are the following codes

while (r1 > tol*tol && k <= max_iter)
    {
        if (k > 1)
        {
            b = r1 / r0;
            cublasStatus = cublasSscal(cublasHandle, N, &b, d_p, 1);
            cublasStatus = cublasSaxpy(cublasHandle, N, &alpha, d_r, 1, d_p, 1);
        }
        else
        {
            cublasStatus = cublasScopy(cublasHandle, N, d_r, 1, d_p, 1);
        }

        cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz, &alpha, descr, d_val, d_row, d_col, d_p, &beta, d_Ax);
        cublasStatus = cublasSdot(cublasHandle, N, d_p, 1, d_Ax, 1, &dot);
        a = r1 / dot;

        cublasStatus = cublasSaxpy(cublasHandle, N, &a, d_p, 1, d_x, 1);
        na = -a;
        cublasStatus = cublasSaxpy(cublasHandle, N, &na, d_Ax, 1, d_r, 1);

        r0 = r1;
        cublasStatus = cublasSdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
        cudaThreadSynchronize();
        printf("iteration = %3d, residual = %e\n", k, sqrt(r1));
        k++;
    }

There is cudaThreadSynchronize() at the end. What is it used for? Is it for the cublasSdot calls? In particular, there are two cublasSdot calls within the while loop. Why is there only one cudaThreadSynchronize() call?