cudaThreadSynchronize

sk8er · December 17, 2009, 9:16pm

I am trying to understand the difference between cudaThreadSynchronize() and when the angle bracket statement FINISHES executing. <<< >>>

My impression so far is that when the <<< >>> finishes executing when all the threads it launched are finished. If that be the case then how is cudaThreadSynchronize() different ?

[url=“http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__THREAD_g6e0c5163e6f959b56b6ae2eaa8483576.html”]http://developer.download.nvidia.com/compu...eaa8483576.html[/url]

LSChien · December 18, 2009, 1:33am

kernel launch is Asynchronous, you need to issue cudaThreadSynchronize() when you want to use output from the kernel.

sk8er · December 18, 2009, 9:59pm

So here my issue, if I launch one thread then it takes time x and if I use 10 threads the time scales linearly to 10x when I am expecting 10 threads not to scale linearly ?

FYI … In both cases I am launching 1 thread per block.

LSChien · December 19, 2009, 5:31am

for example, TeslaC1060 has 30 SM (stream multiprocessor),

if you issue 10 thread blocks ( one trhead per block) to execute kernel code, then

10 thread blocks would be distributed into 10 SM, time is not 10x becasue 10 SM

execute simultaneously.

sk8er · December 21, 2009, 3:17am

so I did read up on SM and need the following advice form you:

What config would execute simultaneously ? 1 block 10 threads?
While you are at it, can you also come up with a config which would allow for 200 simultaneous executions.

Both questions are for the Tesla C1060 .

LSChien · December 22, 2009, 1:19am

physically, only 30 SM can execute “simultaneously”.

However this is not important, from programmer’s view, all threads are executed simultaneously.

sk8er · December 22, 2009, 4:29am

so can you please give a sample grid and block config for the 10 threads so each get assigned to a seperate SM . I Nkow this is trivial but still want to make sure.

LSChien · December 22, 2009, 6:46am

The following code is addition of two vectors.

you can configure

(1) num_thread_per_block: number of threads per block

(2) num_blocks : number of blocks

take large N, for example, N = 100 Mega and

try

num_blocks= 1, 2, 4, 8, 16, 30 and see what happens.

// C = A + B

// A, B, C are 1-D vector with size N 

__global__ void VecAdd(float* A, float* B, float* C, int N )

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	for(; i < N; i+= gridDim.x * blockDim.x ){

		C[i] = A[i] + B[i];

	}

}

int main()

{

	// Kernel invocation

	int num_blocks = 1; 

	int num_thread_per_block = 64;

	VecAdd<<<num_blocks, num_thread_per_block>>>(A, B, C);

}

Topic		Replies	Views
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2004	July 30, 2010
Synchronization methods? CUDA Programming and Performance	11	2126	November 7, 2010
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	841	February 1, 2024
No need to check cudaThreadSynchronize() in release mode? CUDA Programming and Performance	9	6338	April 21, 2009
Kernel Launch: number of blocks CUDA Programming and Performance	1	1698	May 21, 2009
About Synchronize CUDA Programming and Performance	4	1447	March 26, 2009
A few new to CUDA questions CUDA Programming and Performance	3	1112	February 4, 2011
cuda block synchronization CUDA Programming and Performance	4	8401	June 20, 2011
Why does my kernel launch? CUDA Programming and Performance	5	5987	February 13, 2009
cant call any kernel function CUDA Programming and Performance	8	4835	June 6, 2011

cudaThreadSynchronize

Related topics