is there need a streamsynchronize() between kernels and CULA function when use cuda stream?

hlei · October 2, 2017, 4:04am

Hi,everyone,
I have got a problem when use cuda stream with CULA function,the following is the whole pseudocode:

for (int i = 0; i < nstreams; i++)
	{
		checkCudaErrors(cudaMemcpyAsync(dev_X + i*m*n, h_X + i*m*n, m*n * sizeof(float), cudaMemcpyHostToDevice, streams[i]));
		dim3 sumGrid(4, m);
		dim3 sumBlock(1024, 1);
		int sharedSize = sumBlock.x * sizeof(float);
		sumReduction_kernel << <sumGrid, sumBlock, sharedSize, streams[i] >> > (dev_Xmean + i*m, dev_X + i*m*n, m, n);
	}
for (int i = 0; i < nstreams; i++)
	{
		sub1_kernel << <sumGrid, sumBlock, 0, streams[i] >> > (dev_XFinal + i*m*n, dev_X + i*m*n, dev_Xmean + i*m, m, n);
	}
[b]for (int i = 0; i < nstreams; i++)
	{
                 checkCudaErrors(cudaMemcpyAsync(h_Xfinal + i*m*n, dev_XFinal + i*m*n, sizeof(float) * m * n, cudaMemcpyDeviceToHost, streams[i]));
        }[/b]
for (int i = 0; i < nstreams; i++)
	{
		status = culaDeviceSgemm();
		checkStatus(status);

		[b]status = culaDeviceSgetrf();
		checkStatus(status);
[/b]
		status = culaDeviceSgetri();
		checkStatus(status);

		printf("%s\n", "CULA inverse had done!");
		status = culaDeviceSgemm();
		checkStatus(status);
          }

In this pseudocode, if there is no memcpy of h_Xfinal(which has bold ), there will be a following error when execute culaDeviceSgetrf():

“CULA Dense : Data error at pos 1 (see the reference Manual for guidance)”

And if I allocate h_Xfinal with pinned memory, there will be the same error.

I mean what can I use to instead the memcpy of D2H, (there it’s useless and time consumed).

Thank you~

hlei · October 2, 2017, 12:35pm

OK，when I try to use cudaStreamSynchronize(), instead of cudaMemcpyAsync(), it did get the same result. At the same time, it reduce the time of data transfer between device and host.

So I just think there is a synchronize() will be ok.

I don’t know if this is definitely right, so, just that

thanks you for pay attention to this…

Topic		Replies	Views
Why some synchronize function make cudaMemcpyAsync and kernal in different stream work in sequential CUDA Programming and Performance	2	6613	March 1, 2011
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16628	September 29, 2017
cudaStreamSynchronize(a_stream) simpleStreams CUDA Programming and Performance	2	2452	December 2, 2010
Unable to synchronize with a specific stream CUDA Programming and Performance	1	7006	May 21, 2011
cudaStream problem CUDA Programming and Performance	1	1418	November 3, 2009
stream synchronize problem CUDA Programming and Performance	2	794	August 28, 2017
Asyncronus call CUDA Programming and Performance	1	2316	September 24, 2009
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	685	August 23, 2023
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	1094	February 11, 2019
Help related to cuda stream CUDA Programming and Performance	6	2116	April 13, 2010

is there need a streamsynchronize() between kernels and CULA function when use cuda stream?

Related topics