Automatic split of a 512x512 matrix into multiple 8x8 matrices

Hi,

There is 512x512 matrix(gray-scale image) which is stored as 1-D array on GPU (ie, 1x262144 elements)

a) Now i would like to know from this 1-D array how can i extract first 8x8 elements and store it in another 1-D array of size 1x64?

b) Next i want automatically split the 512x512 matrix into multiple 8x8 matrix. how can i do this using CUDA?

c) Computation done with a single 8x8 matrix: There are 3 main steps:

i) Assume ‘A’ is a 8x8 matrix, then first find A*A.
To perform this i have used a kernel K1 <<<1,64>>> (d_A_square,d_A);

ii) Retrieve the row and column index of ‘A’ in the input 512x512 image (ie row 1 to row 8 and column 1 to column 8). How to retrieve this index information?

Please clarify.
Thanks in advance