all printf statement executed in a sequence in a CUDA kernel of dimension (1,3)

I have a following CUDA kernel in which I am just printing the index of the thread. My goal is something different but as I was not able to get the expected results so, I am doing a kind of simple debugging using printf statement.

__global__ void kernel()
{
	const int index = blockIdx.x * blockDim.x + threadIdx.x;

	printf("\n Index: %d\n", index);

	for (int i = 32 * index; i < (index + 32); i++){
		printf("%d ", index);

	}

        printf("\n One Thread Index finished\n");
}

The output of the above kernel is

Index: 0

 Index: 1

 Index: 2
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 One Thread Index finished

 One Thread Index finished

 One Thread Index finished

I am not able to understand the sequence of statements above. Why the statement does not look like following

Index: 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 One Thread Index finished

AND SO ON

I got the answer. The first problem is in the for loop due to which all thread indices are not getting printed. And, the reason behind such a sequence of printf statement is that the threads are running in parallel due to which all the threads are performing the same task at the same time.