Seems the thread/block sizes made a huge difference to some of my kernels (but not all), and now I have hit another strange issue.

One of my kernels seems to vary in both FLOP count and execution time (104us to 87us). I had assumed the time difference was something related to card warmup (changing clockspeed state or something) as it starts off at the 105us, and switches to 85us after the first thirty or so calls. The confusing thing is that no other kernels seem to have this speed shift.

Looking at what nSight says, the kernel seems to vary in FLOP/DFLOP count (which in itself is strange unless _sincosf is not a static FLOP count), in issued IPC (and executed). The kernel when executed with the lower FLOP count takes the 100us, yet when it runs with an extra 150k FLOPs it runs at 85us.

```
template<int blockCount, int fastSinCos>
__global__ void SubPixelShiftKernel512(float shiftX, float shiftY, cuComplex const* d_src, cuComplex* d_dest)
{
const int tid = IMUL(512, blockIdx.x) + threadIdx.x;
const int threadN = IMUL(blockDim.x, gridDim.x);
float twoPiShiftY = TWOPI * shiftY;
//float subj = float(blockIdx.x) / 512.0f; //This will always be below the 0.5f threshhold, unless we have 256 blocks
float subj = __fdividef(float(blockIdx.x), 512.0f);
#if blockCount >= 256
#error Too high a block count provided
#endif
float exp_0 = float(tid & 0x00ff) / 512.0f;
float exp_1 = exp_0 - 0.5f;
float exp_0b;
float exp_1b;
exp_0 *= shiftX;
exp_1 *= shiftX;
exp_0b = TWOPI * exp_0;
exp_1b = TWOPI * exp_1;
exp_0 = (subj * twoPiShiftY) + exp_0b;
exp_1 = (subj * twoPiShiftY) + exp_1b;
for (int i = tid; i < (512*512); i+= (blockCount * 512))
{
cuComplex src = d_src[i];
src.x /= (512.0f*512.0f);
src.y /= (512.0f*512.0f);
cuComplex shift_exp;
if (fastSinCos == 1)
{
__sincosf(exp_0,&shift_exp.x, &shift_exp.y);
}
else
{
shift_exp.x = cosf(exp_0);
shift_exp.y = sinf(exp_0);
}
shift_exp.y = 0 - shift_exp.y;
float ac = shift_exp.x * src.x;
float bd = shift_exp.y * src.y;
float abcd = (shift_exp.x + shift_exp.y) * (src.x + src.y);
shift_exp.x = ac - bd;
shift_exp.y = abcd - ac - bd;
d_dest[i] = shift_exp;
src = d_src[i+256];
src.x /= (512.0f*512.0f);
src.y /= (512.0f*512.0f);
if (fastSinCos == 1)
{
__sincosf(exp_1,&shift_exp.x, &shift_exp.y);
}
else
{
shift_exp.x = cosf(exp_1);
shift_exp.y = sinf(exp_1);
}
shift_exp.y = 0 - shift_exp.y;
ac = shift_exp.x * src.x;
bd = shift_exp.y * src.y;
abcd = (shift_exp.x + shift_exp.y) * (src.x + src.y);
shift_exp.x = ac - bd;
shift_exp.y = abcd - ac - bd;
d_dest[i+256] = shift_exp;
subj += (float(blockCount) / 512.0f);
if (subj >= 0.5f)
{
subj -= 1.0f;
}
exp_0 = (subj * twoPiShiftY) + exp_0b;
exp_1 = (subj * twoPiShiftY) + exp_1b;
}
};
```

I even tried removing the sin/cos to remove the possibility that it is a non-deterministic calculation yet that did not solve the strange changes.

Anyone have any ideas?