Cuda 11.4:

veredz72 · November 4, 2023, 8:41pm

Hello,

In the following simple kernel, each column (i) is subtracted from column (i+1).
Then ABS2 of the result is written to destination.
The code works fine on matrix with 12800 rows, 40 columns.
I checked data at start of destination and at the end of the destination. All data is correct.
Should I run: cudaDeviceSynchronize () after such kernel ?

I checked the elapsed time with and without cudaDeviceSynchronize .
With cudaDeviceSynchronize , the time is much longer.

/**************************************************************************************/
__global__ void mat_col_sub_kernel (ComplexInt32 *pSrc, uint32_t *pDest, int Cols, int Rows)
{
	uint32_t RowId = threadIdx.x + blockIdx.x * blockDim.x;
	uint32_t ColId = threadIdx.y + blockIdx.y * blockDim.y;

	if (ColId>0 && ColId<Cols && RowId<Rows)
	{
		unsigned int idx;
		unsigned int idOut;
		uint32_t Re,Im;

		idx = RowId*Cols + ColId; 
		idOut = RowId*(Cols-1) + ColId-1;
		Re = pSrc[idx].Re - pSrc[idx-1].Re;
		Im = pSrc[idx].Im - pSrc[idx-1].Im;
		pDest[idOut] = Re*Re+Im*Im;
	}
}

/**************************************************************************************/
void mat_col_sub (ComplexInt32 *pSrc, uint32_t *pDest, int Cols, int Rows)
{
	dim3 dimBlock(DIMX, DIMY);
	dim3 dimGrid;
	dimGrid.x = (Rows+ dimBlock.x - 1) / dimBlock.x;
	dimGrid.y = (Cols+ dimBlock.y - 1) / dimBlock.y;

        clock_gettime (CLOCK_REALTIME, &Before);
	mat_col_sub_kernel <<<dimGrid, dimBlock>>> (pSrc, pDest, Cols, Rows);
        clock_gettime (CLOCK_REALTIME, &After);
}

Thank you,
Zvika

striker159 · November 4, 2023, 9:09pm

What do you mean by “much longer”. If you measure the time on the host without cudaDeviceSynchronize , you probably just measure the kernel launch overhead, but not the kernel execution time.

veredz72 · November 5, 2023, 4:47am

Hi striker159, All,

It’s quite strange.
I called:
cudaDeviceSynchronize
cudaMemcpy (… , cudaMemcpyDeviceToHost)
and then wrote the data to reference file.

Then, after power off/on, I read file file, ran the kernel without cudaDeviceSynchronize.
After cudaMemcpy I compared the data to the reference file.
They are identical.
Does it make sense ?

Thank you,
Zvika

striker159 · November 5, 2023, 12:11pm

cudaMemcpy(..., cudaMemcpyDeviceToHost) uses default-stream semantics. The call will automatically wait for any kernels to finish before the copy is executed. You do not need to explicitely call cudaDeviceSynchronize() beforehand.

veredz72 · November 5, 2023, 6:27pm

Hi striker159,

Thank you very much !

Best regards,
Zvika

Topic		Replies	Views
cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish? CUDA Programming and Performance cuda , synchronization	3	3110	February 17, 2021
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	413	April 12, 2022
matrixMul skd sample. Where is cudaThreadSynchronize? CUDA Programming and Performance	3	2025	December 19, 2009
CUDA beginner: understanding the workflow of CUDA kernels and cudaDeviceSynchronize() CUDA Programming and Performance	0	835	November 27, 2017
cudaDeviceSynchronize is very slow CUDA Programming and Performance	1	2151	July 31, 2014
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16632	September 29, 2017
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	1099	February 11, 2019
cudaDeviceSynchronize() CUDA Programming and Performance	1	3214	September 21, 2017
Kernel function calls in regards to cudaSynchronizeDevice(); CUDA Programming and Performance	2	719	May 25, 2017
cudaThreadSynchronize() and timing question CUDA Programming and Performance	7	8333	October 27, 2010

Cuda 11.4:

Related topics