I have a kernel that does “image warping”:

```
__global__ void warp(unsigned char* dst, int isx, int isy, int osx, int osy) {
const int ox = IMAD(blockDim.x, blockIdx.x, threadIdx.x);
const int oy = IMAD(blockDim.x, blockIdx.x, threadIdx.x);
if(ox >= osx || oy >= osy) {
return;
}
double x = funcx(ox,oy);
double y = funcy(ox,oy);
if(x >= 0.0 && y >= 0.0 && x < isx, && y < isx) {
dst[IMAD(oy, osx, ox)] = tex2D(texSrc, x, y);
}
}
```

funcx() and funcy() are rahter lengthy calculations involving trigonometric functions (single and double precision).

Input images are 5616x3744 pixles and output images are 2048 2048. On average about 100,000 pixels are read/written each kernel launch. It is run on a GTX480 card. I use 16x12 threads per block.

It takes about 29ms to do run the kernel. I was expecting more speed. What profiling counters should I have look at to see where the bottle neck is? Any ideas for speedups? Thanks!