The order of CTA execution

Hello. I have a question about the order of CTA execution.
The code below is an example pseudo-code for explaining my curious point.

__global__ void gpuKernel(args){
  printf("The execution order of CTA in CUDA programming model is %d\n", BlockIdx.x + GridDim.x * BlockIdx.y);
dim3 GridDimension(4, 4);
dim3 TBDimension(1, 1);
gpuKernel<<<GridDimension, TBDimension>>>(args);

In my experiment, the order seemed to be random but there were some kind of tendency
like this

15 << always first
10 << always last

So, why does the order not print in order?

I know that the stdout buffer exists but I don’t think it affects this.

Even if GPU uses GTO scheduling methodology, the initial state should be in order. isn’t it?


The order of thread block (CTA) distribution is not documented and is different per architecture. Part of what you are seeing is likely work distribution order and the other part is race condition on allocating room in the printf buffer.

Hi @Greg , Thanks for your reply.

If I understand what you’re saying, it means that the actual order of output and the order of CTA distributed by the GPU’s Hardware Scheduler may be different because race condition may occur for the print buffer.

Then, I have another question.
Is there any method that can monitor or record the order of distributed CTA?
If I think about it in relation to your answer, I think we need another way to check because the stdout output through the print buffer does not represent the exact order.

Apart from this, I thought CTAs would be scheduled in order from number 0 on GPUs in their initial state, which seems to be quite an interesting result.

I want to reiterate that the order is undefined. There is no reason to expect it to be the same from launch to launch, or predictable, and the reason for this unpredictability is unspecified by NVIDIA. Any attempt to rely on a presumed order, however arrived at, would make your code broken by definition.

There is and should be no expectation of any order, nor any repeatability of an order.

You could have each threadblock do an atomic to reserve space in a buffer. Then fill that space with the SM ID, block index, and/or whatever data you wish.

Given that atomic ordering is unspecified/undefined/unpredictable, this method also has uncertainty. Exact ordering without any unspecified character would require HW visibility that AFAIK NVIDIA does not provide.

__device__ unsigned sequence = 0;
__device__ unsigned get_smid(void)
  unsigned ret;
  asm(“mov.u32 %0, %smid;” : “=r”(ret) );
  return ret;
__global__ void k(unsigned *buffer){
  unsigned my_slot = atomicAdd(&sequence, 1);
  unsigned my_sm = get_smid();
  buffer[my_slot]= ((unsigned)blockIdx.x) & (my_sm << 16);

In addition to Robert’s suggestion of logging you can also write out the global timer. The global timer is available via inline PTX %globaltimer or std::chrono::system_clock.

__device__ __forceinline__ uint64_t __globaltimer()
    uint64_t globaltime;
    asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(globaltime));
    return globaltime;


__device__ __forceinline__ uint32_t __globaltimer_lo()
    uint64_t globaltime_lo;
    asm volatile("mov.u32 %0, %%globaltimer_lo;" : "=r"(globaltime_lo));
    return globaltime_lo;

globaltimer is in nanoseconds since 01/01/1970 on most systems. This is not a documented requirement so this value may not be consistent on all platforms.

  • 32 nanosecond on CC >= 9.0
  • 1 microsecond CC < 9.0. Running test with NSYS or NCU will reduce to 32 ns.

Thanks for both replies. @Robert_Crovella, @Greg

Unfortunately, the exact order of distribution of the hardware scheduler cannot be documented or accurately monitored, but I think we should close this issue, as we can use the global timing information available at the device level to verify it to some extent. Using your methods, I think we can make a microbench to determine the hardware characteristics in more detail.

Thank you for your help.