Cuda 11.4:

Hello,

In the following simple kernel, each column (i) is subtracted from column (i+1).
Then ABS2 of the result is written to destination.
The code works fine on matrix with 12800 rows, 40 columns.
I checked data at start of destination and at the end of the destination. All data is correct.
Should I run: cudaDeviceSynchronize () after such kernel ?

I checked the elapsed time with and without cudaDeviceSynchronize .
With cudaDeviceSynchronize , the time is much longer.

/**************************************************************************************/
__global__ void mat_col_sub_kernel (ComplexInt32 *pSrc, uint32_t *pDest, int Cols, int Rows)
{
	uint32_t RowId = threadIdx.x + blockIdx.x * blockDim.x;
	uint32_t ColId = threadIdx.y + blockIdx.y * blockDim.y;

	if (ColId>0 && ColId<Cols && RowId<Rows)
	{
		unsigned int idx;
		unsigned int idOut;
		uint32_t Re,Im;

		idx = RowId*Cols + ColId; 
		idOut = RowId*(Cols-1) + ColId-1;
		Re = pSrc[idx].Re - pSrc[idx-1].Re;
		Im = pSrc[idx].Im - pSrc[idx-1].Im;
		pDest[idOut] = Re*Re+Im*Im;
	}
}

/**************************************************************************************/
void mat_col_sub (ComplexInt32 *pSrc, uint32_t *pDest, int Cols, int Rows)
{
	dim3 dimBlock(DIMX, DIMY);
	dim3 dimGrid;
	dimGrid.x = (Rows+ dimBlock.x - 1) / dimBlock.x;
	dimGrid.y = (Cols+ dimBlock.y - 1) / dimBlock.y;

        clock_gettime (CLOCK_REALTIME, &Before);
	mat_col_sub_kernel <<<dimGrid, dimBlock>>> (pSrc, pDest, Cols, Rows);
        clock_gettime (CLOCK_REALTIME, &After);
}

Thank you,
Zvika

What do you mean by “much longer”. If you measure the time on the host without cudaDeviceSynchronize , you probably just measure the kernel launch overhead, but not the kernel execution time.

Hi striker159, All,

It’s quite strange.
I called:
cudaDeviceSynchronize
cudaMemcpy (… , cudaMemcpyDeviceToHost)
and then wrote the data to reference file.

Then, after power off/on, I read file file, ran the kernel without cudaDeviceSynchronize.
After cudaMemcpy I compared the data to the reference file.
They are identical.
Does it make sense ?

Thank you,
Zvika

cudaMemcpy(..., cudaMemcpyDeviceToHost) uses default-stream semantics. The call will automatically wait for any kernels to finish before the copy is executed. You do not need to explicitely call cudaDeviceSynchronize() beforehand.

1 Like

Hi striker159,

Thank you very much !

Best regards,
Zvika

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.