Do I need to synchronize the stream / threads after a cusolver call?

Hi NVIDIA team! I have this code:

    #pragma acc host_data use_device(Cov, Corr, CovEigVal, CorrEigVal, U, VT)
    {
      cusolverDnDgesvd(cusolverHandle,"N", "N", bands, bands, Cov, bands, CovEigVal, U, bands, VT, bands, work, lwork, rwork, info);
      
      cusolverDnDgesvd(cusolverHandle,"N", "N", bands, bands, Corr, bands, CorrEigVal, U, bands, VT, bands, work, lwork, rwork, info);
    }

Before those calls I’ve also made this:

    #pragma acc host_data use_device(image, Cov)
    {
      cublasDgemm(handle_gemm,CUBLAS_OP_T, CUBLAS_OP_N, bands, bands, N, &alpha, image, N, image, N, &beta, Cov, bands);
    }

    cublasGetStream(handle_gemm, &stream);
    cudaStreamSynchronize(stream);

I believe that last synchronization of the stream after the dgemm call is necessary, or at least recommended. Is there something similar in the first cusolver calls I attached? Do I have to do anything for them to be executed “synchronously” (and by this I mean one after the other is completed).

Also, is the “#pragma acc host_data use_device(…)” right? Or do I have to put one for each function call?

Thank you very much!

This question cannot really be answered in a vacuum. By that, I mean that it may be important to know what you intend to do after those calls, and also how the device accessible memory in question was allocated.

In typical usage, the stream syncs should be unnecessary. Work issued into a given stream is always serialized. That means operation B, issued into stream X, will not begin until operation A, previously issued into stream X, has completed. That is one of the two canonical statements regarding CUDA stream semantics.

There are special cases. For example, if the underlying data is allocated using a managed allocator, then it may be accessible in host code and device code simultaneously. In such a situation, if you had host code that immediately accessed the results after the above code, without an intervening synchronization of some sort, then you might read “stale” data in host code. There are variations on this theme also, for example data allocated in pinned memory. Doing things in other created streams may also break, if you are depending on these results, without appropriate synchronization. That is essentially the other canonical cuda streams semantics statement.

I’m fairly confident that OpenACC host_data directive can wrap more-or-less arbitrary sections of code, and is not required to be repeated for each library function call. I suggest asking detailed OpenACC questions on the hpc compilers forum.

Questions specific to CUBLAS, CUSOLVER, etc. should be asked on the library forum.

1 Like