Hi, I wrote a simple kernel, and try to profile it in OpenCL using GPU performance counters. But the results are a little bit strange.
If I use the kernel bellow:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
__kernel void test(__global float* x, __global float* y)
{
int id = get_global_id(0);
float t = exp2(x[id]);
y[id] = y[id] + t;
}
and put in a NDR of 1024 global items, work-group size being 256 then the output of the profiler is very stable(let’s consider the instruction count only).
But if I add a simple for loop into the kernel like this:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
__kernel void test(__global float* x, __global float* y)
{
int id = get_global_id(0);
for (int i = 0; i < 100; i++)
{
float t = exp2(x[id]);
y[id] = y[id] + t;
}
}
Then the results are unstable, ranging from 6000-11000 instructions in different runs.
Could anyone explain me why there is such a uncertainty in profiling results?
Best,
Thanks