Slow image warping

Pibben · January 19, 2011, 9:36am

I have a kernel that does “image warping”:

__global__ void warp(unsigned char* dst, int isx, int isy, int osx, int osy) {

  const int ox = IMAD(blockDim.x, blockIdx.x, threadIdx.x);

  const int oy = IMAD(blockDim.x, blockIdx.x, threadIdx.x);

if(ox >= osx || oy >= osy) {

    return;

  }

double x = funcx(ox,oy);

  double y = funcy(ox,oy);

if(x >= 0.0 && y >= 0.0 && x < isx, && y < isx) {

    dst[IMAD(oy, osx, ox)] = tex2D(texSrc, x, y);

  }

}

funcx() and funcy() are rahter lengthy calculations involving trigonometric functions (single and double precision).

Input images are 5616x3744 pixles and output images are 2048 2048. On average about 100,000 pixels are read/written each kernel launch. It is run on a GTX480 card. I use 16x12 threads per block.

It takes about 29ms to do run the kernel. I was expecting more speed. What profiling counters should I have look at to see where the bottle neck is? Any ideas for speedups? Thanks!

tera · January 19, 2011, 10:27am

Seems you are still compute bound, so funcx() and funcy() are the places to optimize. Is it really necessary to use double precision? Are you using the lower precision intrinsic trigonometric functions? Try the [font=“Courier New”]–use_fast_math[/font] command line switch. You could evaluate funcx() and funcy() at fewer pixel positions and interpolate in between.

Pibben · January 19, 2011, 11:07am

Removing computations results in about 3ms runtime.

How do I calculate the memory hroughput when reading from texture memory and storing to global? tex0_cache_sector_misses is 0, so that one I ruled out.

tera · January 19, 2011, 11:47am

So you really are compute bound. Try my suggestions above.

I’ve just used the worst case where every texture interpolation involves reading the four surrounding pixels. If even that can’t explain the time, the kernel must be compute bound.

Dittoaway · January 19, 2011, 4:31pm

Timing is tricky when you remove the calculations.
The compiler might be optimizing away most of the kernel.

And is that really your code? ox=oy?

Simon_Green · January 19, 2011, 7:09pm

I agree using doubles is probably overkill, especially since the texture interpolation hardware only works at float precision. GTX480 has relatively slow double precision performance.

Topic		Replies	Views
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4211	July 30, 2007
Kernel is slow - don't know why CUDA Programming and Performance	4	3177	May 11, 2012
Image processing with CUDA: design question. CUDA Programming and Performance	5	1016	January 26, 2018
2D texture fetch performance when interpolating CUDA Programming and Performance	0	1517	December 20, 2007
why texture makes it slower? CUDA Programming and Performance	0	906	July 8, 2009
Kernel launched in for loop with index offset gives incorrect result? CUDA Programming and Performance	21	29	March 4, 2025
CUDA Runs 1/10th the speed of openGL CUDA Programming and Performance	9	7188	September 3, 2008
Why are texture memory reads slower than global reads even though it is being accessed spatially? CUDA Programming and Performance cuda	0	455	June 19, 2020
Strange Behavior on image processing CUDA Programming and Performance	3	1795	September 8, 2008
Very slow texture reads. CUDA Programming and Performance	0	924	June 10, 2010

Slow image warping

Related topics