yet another clock() question

I have a kernel that looks something like this:

__global__ void test_timer(float *C, bench* BM)

{

    clock_t beginning = clock();

   int tid = blockDim.x * blockIdx.x + threadIdx.x;

   int a = 0;

   int i;

    for(i=0;i<10000;i++)

    {

        a = a + i;

    }

   clock_t ending = clock();

    C[tid] = a;

   BM[tid].start = beginning;

    BM[tid].end = ending;

}

Now I would expect all threads to start and end at the same time as the MP is SIMD hardware. This however is not the case.

running with 1 thread:

tid | start time | end time | duration

0 | 2793598 | 3594136 | 800538

running with 2 thread:

tid | start time | end time | duration

0 | 2893796 | 3694320 | 800524

1 | 2893788 | 3694328 | 800540

running with 4 thread:

tid | start time | end time | duration

0 | 2807946 | 3608452 | 800506

1 | 2807938 | 3608460 | 800522

2 | 2807932 | 3608472 | 800540

3 | 2807924 | 3608480 | 800556

running with 8 thread:

tid | start time | end time | duration

0 | 2816142 | 3616628 | 800486

1 | 2816134 | 3616636 | 800502

2 | 2816090 | 3616672 | 800582

3 | 2816082 | 3616680 | 800598

4 | 2816110 | 3616616 | 800506

5 | 2816102 | 3616624 | 800522

6 | 2816134 | 3616660 | 800526

7 | 2816126 | 3616668 | 800542

I am quite confused as I believed that having SIMD hardware means that all the threads execute the same instructions at the same time. so I would expect the timings to be identical.