Strange results when profiling inst_executed for OpenCL kernel

Hi, I wrote a simple kernel, and try to profile it in OpenCL using GPU performance counters. But the results are a little bit strange.

If I use the kernel bellow:

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

__kernel void test(__global float* x, __global float* y)
{
int id = get_global_id(0);
float t = exp2(x[id]);
y[id] = y[id] + t;
}

and put in a NDR of 1024 global items, work-group size being 256 then the output of the profiler is very stable(let’s consider the instruction count only).

But if I add a simple for loop into the kernel like this:

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

__kernel void test(__global float* x, __global float* y)
{
int id = get_global_id(0);
for (int i = 0; i < 100; i++)
{
float t = exp2(x[id]);
y[id] = y[id] + t;
}
}

Then the results are unstable, ranging from 6000-11000 instructions in different runs.

Could anyone explain me why there is such a uncertainty in profiling results?

Best,
Thanks