I’m trying to make a panning map kind of thing (like google maps), to learn memory management in CUDA. I have some ideas for an algorithm that involves shifting images around and wrapping them around (so that the pixel at (y, -1) = (y, width - 1) ). What would be an efficient way to do this? Right now I have three versions:

[codebox]

// Shift with normalized texture and wrapping

**global** void copyShiftTexNorm(unsigned int *out, int pitch, int width, int height, int dx, int dy){

```
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
out[pitch*y + x] = rgbaFloatToInt(tex2D(tex, (x + dx) / (float) width, (y + dy) / (float) height));
```

}

// Shift with unnormalized texture (which forces me to not have wrapping)

**global** void copyShiftTex(unsigned int *out, int pitch, int width, int height, int dx, int dy){

```
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
int tx = (width + x + dx) % width;
int ty = (height + y + dy) % height;
out[pitch*y + x] = rgbaFloatToInt(tex2D(tex, tx, ty));
```

}

// Shift without textures

**global** void copyShift(unsigned int *out, const unsigned int *in, int pitch, int width, int height, int dx, int dy)

{

```
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
int tx = (width + x + dx) % width;
int ty = (height + y + dy) % height;
out[pitch*y + x] = in[pitch*ty + tx];
```

}

[/codebox]

The first version doesn’t work, it kind of blurs out the image. I don’t know why, maybe it’s some sort of floating point precision thing. However it gives vastly faster results than the other two ( 40% fps increase). The other are about equal but the texture one gives slightly lower but more stable frame rates. Any idea on how to make version 1 work or how to make the others as efficient? Any help is much appreciated :)