I have a kernel that looks something like this:
__global__ void test_timer(float *C, bench* BM)
{
clock_t beginning = clock();
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int a = 0;
int i;
for(i=0;i<10000;i++)
{
a = a + i;
}
clock_t ending = clock();
C[tid] = a;
BM[tid].start = beginning;
BM[tid].end = ending;
}
Now I would expect all threads to start and end at the same time as the MP is SIMD hardware. This however is not the case.
running with 1 thread:
tid | start time | end time | duration
0 | 2793598 | 3594136 | 800538
running with 2 thread:
tid | start time | end time | duration
0 | 2893796 | 3694320 | 800524
1 | 2893788 | 3694328 | 800540
running with 4 thread:
tid | start time | end time | duration
0 | 2807946 | 3608452 | 800506
1 | 2807938 | 3608460 | 800522
2 | 2807932 | 3608472 | 800540
3 | 2807924 | 3608480 | 800556
running with 8 thread:
tid | start time | end time | duration
0 | 2816142 | 3616628 | 800486
1 | 2816134 | 3616636 | 800502
2 | 2816090 | 3616672 | 800582
3 | 2816082 | 3616680 | 800598
4 | 2816110 | 3616616 | 800506
5 | 2816102 | 3616624 | 800522
6 | 2816134 | 3616660 | 800526
7 | 2816126 | 3616668 | 800542
I am quite confused as I believed that having SIMD hardware means that all the threads execute the same instructions at the same time. so I would expect the timings to be identical.