Shifting an 2d array to the left doesn't work

Hello,

I am trying to write a simple kernel that shifts a 2d array to the left and wraps around, copying the elements lost on the left to the far right (periodic boundary conditions). It is just the copying that is the problem, the shift of the whole array to the left works. I have the following code, but note that readimage and writeimage just fill the 1024x1024 array with a black and white image of me, but it could be anything. I have attached the output of the image, which has a double copy of what was on the left and also a copy of the original image.

grayme.pgm (1015.0 KB)

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cuda_runtime.h>

#define bsx 32
#define bsy 32
#define bs (bsx*bsy)

global void cushift(double *phi_d, double *temp_d, int shift, int nx, int ny) {

int gindex_x = threadIdx.x + blockIdx.x * blockDim.x;
int gindex_y = threadIdx.y + blockIdx.y * blockDim.y;
int lindex_x = threadIdx.x;
int lindex_y = threadIdx.y;

int gindex = gindex_x + bs*gindex_y;
int lindex = lindex_x + bsx*lindex_y;

if((gindex_x < shift) && (gindex_y < bs)){
    temp_d[gindex_x + bs*gindex_y] = phi_d[(gindex_x) + bs*gindex_y];
}

__syncthreads();

if((gindex_x < bs) && (gindex_y < bs) && (gindex_x >= shift)){
    phi_d[(gindex_x - shift) + bs*gindex_y] = phi_d[(gindex_x) + bs*gindex_y];
}

 __syncthreads();

if((gindex_x < shift) && (gindex_y < bs)){
    phi_d[(bs + gindex_x - shift) + bs*gindex_y] = temp_d[gindex_x + bs*gindex_y];
}

}

int main(){

int nx = 1024;
int ny = 1024;
double *f = (double *) malloc(sizeof ( double) * nx*ny);
double *f_d, *temp_d;

cudaMalloc ((void**)&f_d, sizeof(double)*nx*ny);
cudaMalloc ((void**)&temp_d, sizeof(double)*nx*ny);

cudaMemcpy(f_d, f, sizeof(double)*nx*ny, cudaMemcpyHostToDevice);

dim3 dimGrid (int((nx-0.5)/bsx) + 1, int((ny-0.5)/bsy) + 1);
dim3 dimBlock (bsx, bsy);

readimage(f, nx, ny);

cushift<<<dimGrid, dimBlock>>>(f_d,temp_d,100,nx,ny);

cudaMemcpy(f, f_d, sizeof(double)*nx*ny, cudaMemcpyDeviceToHost);

writeimage(f, nx, ny);

}
fftshift2D.pgm (1.0 MB)

Is there a specific reason to physically copy data instead of simply adding an offset (followed by wrap-around) into the addressing computation? Given that the relevant dimension is a power of two, the wrap around would use an AND which is cheap.

No specific reason to do it the way I have. I hadn’t really thought about doing it any other way.

I asked you here to properly format code you post on this forum, and explained one way to do that.