Slow image warping

I have a kernel that does “image warping”:

__global__ void warp(unsigned char* dst, int isx, int isy, int osx, int osy) {

  const int ox = IMAD(blockDim.x, blockIdx.x, threadIdx.x);

  const int oy = IMAD(blockDim.x, blockIdx.x, threadIdx.x);

if(ox >= osx || oy >= osy) {



double x = funcx(ox,oy);

  double y = funcy(ox,oy);

if(x >= 0.0 && y >= 0.0 && x < isx, && y < isx) {

    dst[IMAD(oy, osx, ox)] = tex2D(texSrc, x, y);



funcx() and funcy() are rahter lengthy calculations involving trigonometric functions (single and double precision).

Input images are 5616x3744 pixles and output images are 2048 2048. On average about 100,000 pixels are read/written each kernel launch. It is run on a GTX480 card. I use 16x12 threads per block.

It takes about 29ms to do run the kernel. I was expecting more speed. What profiling counters should I have look at to see where the bottle neck is? Any ideas for speedups? Thanks!

Seems you are still compute bound, so funcx() and funcy() are the places to optimize. Is it really necessary to use double precision? Are you using the lower precision intrinsic trigonometric functions? Try the [font=“Courier New”]–use_fast_math[/font] command line switch. You could evaluate funcx() and funcy() at fewer pixel positions and interpolate in between.

Removing computations results in about 3ms runtime.

How do I calculate the memory hroughput when reading from texture memory and storing to global? tex0_cache_sector_misses is 0, so that one I ruled out.

So you really are compute bound. Try my suggestions above.

I’ve just used the worst case where every texture interpolation involves reading the four surrounding pixels. If even that can’t explain the time, the kernel must be compute bound.

Timing is tricky when you remove the calculations.
The compiler might be optimizing away most of the kernel.

And is that really your code? ox=oy?

I agree using doubles is probably overkill, especially since the texture interpolation hardware only works at float precision. GTX480 has relatively slow double precision performance.