cudaThreadSynchronize

I am trying to understand the difference between cudaThreadSynchronize() and when the angle bracket statement FINISHES executing. <<< >>>

My impression so far is that when the <<< >>> finishes executing when all the threads it launched are finished. If that be the case then how is cudaThreadSynchronize() different ?

http://developer.download.nvidia.com/compu…eaa8483576.html

kernel launch is Asynchronous, you need to issue cudaThreadSynchronize() when you want to use output from the kernel.

So here my issue, if I launch one thread then it takes time x and if I use 10 threads the time scales linearly to 10x when I am expecting 10 threads not to scale linearly ?

FYI … In both cases I am launching 1 thread per block.

for example, TeslaC1060 has 30 SM (stream multiprocessor),

if you issue 10 thread blocks ( one trhead per block) to execute kernel code, then

10 thread blocks would be distributed into 10 SM, time is not 10x becasue 10 SM

execute simultaneously.

so I did read up on SM and need the following advice form you:

  1. What config would execute simultaneously ? 1 block 10 threads?

  2. While you are at it, can you also come up with a config which would allow for 200 simultaneous executions.

Both questions are for the Tesla C1060 .

physically, only 30 SM can execute “simultaneously”.

However this is not important, from programmer’s view, all threads are executed simultaneously.

so can you please give a sample grid and block config for the 10 threads so each get assigned to a seperate SM . I Nkow this is trivial but still want to make sure.

The following code is addition of two vectors.

you can configure

(1) num_thread_per_block: number of threads per block

(2) num_blocks : number of blocks

take large N, for example, N = 100 Mega and

try

num_blocks= 1, 2, 4, 8, 16, 30 and see what happens.

// C = A + B

// A, B, C are 1-D vector with size N 

__global__ void VecAdd(float* A, float* B, float* C, int N )

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	for(; i < N; i+= gridDim.x * blockDim.x ){

		C[i] = A[i] + B[i];

	}

}

int main()

{

	// Kernel invocation

	int num_blocks = 1; 

	int num_thread_per_block = 64;

	VecAdd<<<num_blocks, num_thread_per_block>>>(A, B, C);

}