Record the execution order of the threads?

Hey all,
Is there a way to record the execution sequence of the threads in a block?As like this in the xxx_kernel.cu:
int tid = threadIdx.y*blockDim.x+threadIdx.x;
__syncthreads();

Since the threads executed independently,when a thread arrived the __syncthreads()?How can i record the order? I have trid ,but no result.Maybe it is impossible to do that ?
Any reply is appreciated!

Well, if this is from an example, I guess first all threads with threadIdx.y == 0 are run, then with threadIdx.y==1, etc.

I am actually also looking for this answer as I need to do a reduction at the end of a kernel that has a 2D blocksize. I was thinking of running this code:

int output[16];

cudamalloc(d_output, 16*sizeof(int));

test_kernel<<<1,dim3(16,16,1)>>>(d_output);

cudamemcopy(output, d_output, 16 * sizeof(int));

for (int k = 0; k < 16; k++)

  printf("%d ", output[k]);
__global__ void test_kernel(int *out)

out[threadIdx.x] = threadIdx.y;

If first threadIdx.y==0 is run in the first half warp, then threadIdx.y==1 in the second half warp, etc, etc. The printed values should be all 15

Thanks for your reply.

I have tried the code what you thought ,but the printed values are awlays 11 running on my computer.I don’t know why.It seems not stochastic.But i can’t find the rule.

hmm, tomorrow when back at work I will try to figure it out.
The fact that all values are the same is an indication that int tid = threadIdx.y*blockDim.x+threadIdx.x; is the correct formula.

tid = [0 31] is 1 warp, tid = [32 63] another warp in that case

If you find out the rules or the way to solve the problem,please tell me .

Appreciated!

You should never depend on the order of execution of independent threads.

With that disclaimer, you could perform a study on how stocahstic the thread ordering is using the clock instruction to record the clock time at which a warp executes. Comparing these clock times within a block could give you an ordering.

Note: If you do this with a “toy” kernel that doesn’t read any global memory or do anything else, I would guess you will find the results less stochastic than one that has a significant amount of memory access/computation interleaving. Such interleaving causes more variances in the ordering of executed warps.

But there is an ordering of how threadIdx.x are distributed over the warps. (the reduction sample uses the fact that threadIdx.x 0 - 31 are in 1 warp)

Have you seen if there is also such an ordering for threadIdx.x & threadIdx.y?

This is admittedly offtopic for the original question now I read it again, but I thought that this is what he was actually looking for. Indeed the order in which a syncthreads is reached is very, very undetermined (take e.g. a kernel in which a large piece of code is within a if(threadIdx.x==0) { })

You are right.I tested it just with a simple kernel to find the absolutely order of the threads within a block,of course the order for many tests must be very variances from each other. I will try it again using the clock instruction.Thanks for your guideness.