Example of Matrix multiplication

hi all,

i have a question about data teiling in the Matrix muliplication SDK.
The Code for data tiling from device memory to shared memory is following:

//code from SDK
for (int a = aBegin, b = bBegin;a <= aEnd;a += aStep, b += bStep) {
shared float As[BLOCK_SIZE][BLOCK_SIZE]; //Block_size = 3
shared float Bs[BLOCK_SIZE][BLOCK_SIZE];

    AS(ty, tx) = A[a + wA * ty + tx];
    BS(ty, tx) = B[b + wB * ty + tx];

    __syncthreads();
    ....
}

In the example the same amount tiling data from both matrix, A and B, is copied. Now I want to copy for example a 33 Block from A, but 55 Block from B each time. The both tilings have the same centre.
How can i execute now the threads? I’m some deluded with the threadIDs…

This is a figure about the question:

i think, maybe i need tow loops, which have different “begin” and “end” like:

for (int a = aBegin;a <= aEnd;a += aStep) {
shared float As[BLOCK_SIZE_A][BLOCK_SIZE_A]; //Block_size_A = 3

    AS(ty, tx) = A[a + wA * ty + tx];

    __syncthreads();
    ....
}

for (int b = bBegin;b <= bEnd; b += bStep) {
shared float As[BLOCK_SIZE_B][BLOCK_SIZE_B]; //Block_size_B = 5

    BS(ty, tx) = B[b + wB * ty + tx];

    __syncthreads();
    ....
}

Is that right?