matrixMul skd sample. Where is cudaThreadSynchronize?

Hello, in SDK sample matrixMul (with shared memory) we can see such code:

// Invoke kernel

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

// Read C from device memory

cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

The kernel launch is Asynchronous, so it seems that, we need to call cudaThreadSynchronize(), when we want to use output matrix from the kernel.

Why there isn’t any cudaThreadSynchronize in matrixMul sample ?

Does cudaMemcpy do synchronization automatically ?

Thank you

Yes, there is an implicit synchronization when calling cudaMemcpy.

Thank you,

and if I want to measure only kernel execution time I should write such code ?:

// copy host memory to device

   ...

   // allocate device memory for result

   ...

   // allocate host memory for the result

   ... 

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(WC / threads.x, HC / threads.y);

	cudaEvent_t startK,stopK;

	cudaEventCreate(&startK); cudaEventCreate(&stopK);

	cudaEventRecord(startK,0);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	cudaEventRecord(stopK,0);

	cudaEventSynchronize(stopK);

	float elapsedTimeK;

	cudaEventElapsedTime(&elapsedTimeK,startK,stopK);

	printf("Processing time: %f (ms) \n", elapsedTimeK);[/b]

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	// copy result from device to host

	cutilSafeCall(cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost) );

I use Tesla C1060 device and have the result:

Size of matrix: 16384 x 16384, Block size = 16

Memory preparation: 726.160645 (ms)

Processing time: 44498.328125 (ms)

Copying result: 463.738159 (ms)

Could somebody tell me please, is it normal time (45 sec) for matrixMul ?

you can try CUBLAS, it only costs 23.5 second.

SDK example is not fast, please read Volkov’s paper

Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC ’08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA, 2008, IEEE Press. http://forums.nvidia.com/index.php?showtopic=89084