[font=“Arial”]

I have an array of 800*4096 elements. My original C program has a three-level nested loop:

```
for(i = 0; i < 800; ++i)
{
for(j = 0; j < 64; ++j)
{
for(k = 0; k < 64; ++k)
{
x1 = (i-(31*(int)(i/31)))*15 + k + 2048;
y1 = (i/31)*15 + j + 2048;
index = (y1-(2048*(int)(y1/2048)))*2048 + x1-(2048*(int)(x1/2048));
out1[i*4096+j*64+k] = in1[index];
}
}
}
```

CUDA:

I am confused on how to break this into blocks.

I tried to do a calculation using 64 blocks and 64 threads and iterate through them 800 times.

```
__shared__ int x[64];
__shared__ int y[64];
__shared__ int indx[64];
int val = 0;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = 0; i < 800; ++i)
{
if(threadIdx.x < 64)
{
x[threadIdx.x] = (i-(31*(int)(i/31)))*15 + threadIdx.x + 2048;
y[threadIdx.x] = (i/31)*15 + blockIdx.x + 2048;
indx[threadIdx.x] = (y[threadIdx.x] - (2048*(int)(y[threadIdx.x]/2048)))*2048 + (x[threadIdx.x] - (2048*(int)(x[threadIdx.x]/2048)));
val = indx[threadIdx.x];
}
__syncthreads();
}
out[tid] = in[val];
```

any suggestions ??

[/font]