I am trying to understand the difference between cudaThreadSynchronize() and when the angle bracket statement FINISHES executing. <<< >>>
My impression so far is that when the <<< >>> finishes executing when all the threads it launched are finished. If that be the case then how is cudaThreadSynchronize() different ?
So here my issue, if I launch one thread then it takes time x and if I use 10 threads the time scales linearly to 10x when I am expecting 10 threads not to scale linearly ?
FYI … In both cases I am launching 1 thread per block.
so can you please give a sample grid and block config for the 10 threads so each get assigned to a seperate SM . I Nkow this is trivial but still want to make sure.
(1) num_thread_per_block: number of threads per block
(2) num_blocks : number of blocks
take large N, for example, N = 100 Mega and
try
num_blocks= 1, 2, 4, 8, 16, 30 and see what happens.
// C = A + B
// A, B, C are 1-D vector with size N
__global__ void VecAdd(float* A, float* B, float* C, int N )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
for(; i < N; i+= gridDim.x * blockDim.x ){
C[i] = A[i] + B[i];
}
}
int main()
{
// Kernel invocation
int num_blocks = 1;
int num_thread_per_block = 64;
VecAdd<<<num_blocks, num_thread_per_block>>>(A, B, C);
}