he;p regarding matrix multiplication using pitch

hello
i am trying to do matrix multiplication on a 2D array using pitch.
i am able to load the 2D array on gpu using cudaMallocPitch() and cudaMemcpy2D() function, but i am not able to write the multiplication code.
The output which i am getting is wrong.
Can anyone help me out in the code

here’s the which i have written

//—code for matrix multiplication using pitch—

float Pvalue=0;
xid = blockIdx.x * blockDim.x + threadIdx.x;
yid = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < N; ++k) { //D=TM
float Melement = T[yid
pitch+k];
float Nelement = M[k+xidpitch];
Pvalue += Melement * Nelement;
}
D[yid
pitch+xid] = Pvalue;
__syncthreads();

//---------

i am waiting for the help
thanx in advance…

I guess that you want to compute

D( yid, xid) = T(yid, k) * M(k, xid) // sum over k

if you use “T[yid*pitch+k]” to represent T(yid, k), then

why you use “M[k+xidpitch]" to represent M(k, xid), not "M[kpitch + xid]”

@LSChien:

thanx for replying.

i coded it as you said and still i am getting the wrong answer.

Actually i want to multiply 3 matrices as D=TMT’ where T’ is the transpose matrix of T.

please help me.

the code after i cahb=nged was like this:-

int xid = blockIdx.x * blockDim.x + threadIdx.x;

int yid = blockIdx.y * blockDim.y + threadIdx.y;

if(xid<N && yid<N)

T_[xid*pitch+yid] = T[yid*pitch+xid];

__syncthreads();

float Pvalue=0;

xid = blockIdx.x * blockDim.x + threadIdx.x;

yid = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < N; ++k) { //D=T*M

float Melement = T[yid*pitch+k];

float Nelement = M[k*pitch+xid];

Pvalue += Melement * Nelement;

}

D[yid*pitch+xid] = Pvalue;

__syncthreads();

xid = blockIdx.x * blockDim.x + threadIdx.x;

yid = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k < N; ++k) { //D=T*M

float Melement = T_[yid*pitch+k];

float Nelement = D[k*pitch+xid];

Pvalue += Melement * Nelement;

}

M[yid*pitch+xid] = Pvalue;

__syncthreads();

please tell me where to make corrections.

Please check the value of the pitch that is allocated for a row. Also you have to ignore the memory location after the width of the matrix (or length of the row). So you have to add additional logic for accessing the values in matrix. the value accessing is the problem in your code. Most probably the size of pitch would be a multiple of 4 , 8, 16 or 32. For eg. So if your width is 13, the pitch would be allocated will be 16.