I am trying to understand the difference between cudaThreadSynchronize() and when the angle bracket statement FINISHES executing. <<< >>>

My impression so far is that when the <<< >>> finishes executing when all the threads it launched are finished. If that be the case then how is cudaThreadSynchronize() different ?…eaa8483576.html

kernel launch is Asynchronous, you need to issue cudaThreadSynchronize() when you want to use output from the kernel.

So here my issue, if I launch one thread then it takes time x and if I use 10 threads the time scales linearly to 10x when I am expecting 10 threads not to scale linearly ?

FYI … In both cases I am launching 1 thread per block.

for example, TeslaC1060 has 30 SM (stream multiprocessor),

if you issue 10 thread blocks ( one trhead per block) to execute kernel code, then

10 thread blocks would be distributed into 10 SM, time is not 10x becasue 10 SM

execute simultaneously.

so I did read up on SM and need the following advice form you:

  1. What config would execute simultaneously ? 1 block 10 threads?

  2. While you are at it, can you also come up with a config which would allow for 200 simultaneous executions.

Both questions are for the Tesla C1060 .

physically, only 30 SM can execute “simultaneously”.

However this is not important, from programmer’s view, all threads are executed simultaneously.

so can you please give a sample grid and block config for the 10 threads so each get assigned to a seperate SM . I Nkow this is trivial but still want to make sure.

The following code is addition of two vectors.

you can configure

(1) num_thread_per_block: number of threads per block

(2) num_blocks : number of blocks

take large N, for example, N = 100 Mega and


num_blocks= 1, 2, 4, 8, 16, 30 and see what happens.

// C = A + B

// A, B, C are 1-D vector with size N 

__global__ void VecAdd(float* A, float* B, float* C, int N )


	int i = blockIdx.x * blockDim.x + threadIdx.x;

	for(; i < N; i+= gridDim.x * blockDim.x ){

		C[i] = A[i] + B[i];



int main()


	// Kernel invocation

	int num_blocks = 1; 

	int num_thread_per_block = 64;

	VecAdd<<<num_blocks, num_thread_per_block>>>(A, B, C);