Efficient matrix shift with wrapping For panning large images for example

I’m trying to make a panning map kind of thing (like google maps), to learn memory management in CUDA. I have some ideas for an algorithm that involves shifting images around and wrapping them around (so that the pixel at (y, -1) = (y, width - 1) ). What would be an efficient way to do this? Right now I have three versions:

[codebox]

// Shift with normalized texture and wrapping

global void copyShiftTexNorm(unsigned int *out, int pitch, int width, int height, int dx, int dy){

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

out[pitch*y + x] = rgbaFloatToInt(tex2D(tex, (x + dx) / (float) width, (y + dy) / (float) height));

}

// Shift with unnormalized texture (which forces me to not have wrapping)

global void copyShiftTex(unsigned int *out, int pitch, int width, int height, int dx, int dy){

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

int tx = (width + x + dx) % width;

int ty = (height + y + dy) % height;

out[pitch*y + x] = rgbaFloatToInt(tex2D(tex, tx, ty));

}

// Shift without textures

global void copyShift(unsigned int *out, const unsigned int *in, int pitch, int width, int height, int dx, int dy)

{

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

int tx = (width + x + dx) % width;

int ty = (height + y + dy) % height;

out[pitch*y + x] = in[pitch*ty + tx];

}

[/codebox]

The first version doesn’t work, it kind of blurs out the image. I don’t know why, maybe it’s some sort of floating point precision thing. However it gives vastly faster results than the other two ( 40% fps increase). The other are about equal but the texture one gives slightly lower but more stable frame rates. Any idea on how to make version 1 work or how to make the others as efficient? Any help is much appreciated :)

I don’t think you want to use texture for this copy unless you keep reusing that texture (you might get slightly less memory bandwidth using texture caching than with direct global memory access). Plus with your texture version, you need to make sure you use wrap around addressing (cudaAddressModeWrap).

I’ve recently implemented Matlab’s circshift function() in C and know it can be done with 4 simple 2D memory copies:

before

visualize the image as 4 blocks. The intersection of the 4 blocks [cx, cy] = Wrap([shiftX shiftY], width, height)

is location of new, transformed origin:

|-----|-----|

|  2  |  3  |

|-----|-----|

|  0  |  1  |

|-----|-----|

after	copies

|-----|-----|

|  1  |  0  |

|-----|-----|

|  3  |  2  |

|-----|-----|

I leave finding the 2D addressing calculations to you.

Oh good point, I was being dumb :) Still funky that the textures don’t work with normalized coordinates though.

Did you implement the shift as 4 cudaMemcpy2Ds like I said? I said you probably shouldn’t use textures here because there seems to be no data reuse and using texture cache pays a premium of a some extra memory loads (maybe).

Yeah I did it with four copies instead. Much easier :)

Could either of you explain how circshift can be achieved with memcpy2d? I am familiar with the procedure carried out by circshift but cant immediately see how it would be implemented in this way.

This should be a simple problem solving exercise. When in doubt, just solve a simpler instance:

Just think of how to do it for the case when width & height are both even, and shiftX = width / 2, shiftY = height / 2. Very simple. Then generalize from there.