2D matrix addition question

Hello everyone,

I want to make a simple addition between 2 two 2D matrices Agpu and Bgpu, each one having 5 columns and 4 rows, and store it to another matrix called Cgpu. I also want to exploit the GPU’s parallel execution benefit, so I use 1 block with dimensions dim3 dimBlock (5,4). These are the 5 steps that I perform:

//1. GPU memory allocation for matrices Agpu, Bgpu and Cgpu
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Agpu,&Apitch,5sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Bgpu,&Bpitch,5
sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Cgpu,&Cpitch,5*sizeof(float),4));

//2. Transfer data from host matrices A, B, C to device matrices Agpu, Bgpu, Cgpu
cudaMemcpy2D(Agpu,Apitch,A,5sizeof(float),5sizeof(float),4
,cudaMemcpyHostToDevice);
cudaMemcpy2D(Bgpu,Apitch,B,5sizeof(float),5sizeof(float),4
,cudaMemcpyHostToDevice);

//3. Divide block to 5 columns and 4 rows
dim3 dimBlock (5,4);

//4. call the kernel
mat_add<<<1,dimBlock>>>(Agpu,Bgpu,Cgpu);

//5. copy back the result from device Cgpu matrix to host C matrix
cudaMemcpy2D(C,5sizeof(float),Cgpu,Cpitch,5sizeof(float),4
,cudaMemcpyDeviceToHost

//kernel
global void mat_add(float *A,float *B,float *C)
{
int i=threadIdx.x;
int j=threadIdx.y;
C[i+j*5]=A[i+j*5]+B[i+j*5];
}

However only half of the C matrix elements are correct!! :wacko:
Also, another thing that I noticed is that if print the allocated pitch from cudaMallocPitch is 64. Since pitch is the width allocated (in bytes), and I allocate 5sizeof(float), shouldn’t it be 54bytes=20?

Can anyone suggest some advice?

Kind regards,
dtheodor

Assuming you are acessing your matrices in a column major format shouldn’t this line:

C[i+j*5]=A[i+j*5]+B[i+j*5];

be

C[i+j*4]=A[i+j*4]+B[i+j*4];

?

Here are the macros I am using to access matrices:

define indexR(i, j, n_cols) ((j) + ((i) * (n_cols))) //row-major matrix

define indexC(i ,j, n_rows) (((j) * (n_rows)) + (i)) //column major order addressing + 1st element has id#0

Hello Panajev,

Thank you very much for your reply.

Actually I’m following a row major format, so I think this line should be ok.

This is how I print the data after they have been copied back from the GPU:

for (j=0;j<4;j++) //rows iterator

{

[indent]for (i=0;i<5;i++) //columns iterator

printf(“\nC[%d]=%f”,i+j*5,C[i+j*5]);[/indent]

}

The funny thing is that if I increase the block boundaries (e.g. dim3 dimBlock (20,16)) the results are ok! :blink:

Since I am new to CUDA programming, I was trying to make sure that I have allocated the proper memory size and performed correctly the cudaMemCpy2d. But I cannot find something wrong…

If you are following a row major format then you want (i * NUMBER_OF_COLUMNS + j ) and not (j * NUMBER_OF_COLUMNS + i), but that might not be the problem.

… wait, you are using i for the columns and j for the rows? Ok, so that is not the problem…

but… you assign i and j this way…

int i=threadIdx.x;

int j=threadIdx.y;

could you try to set them this way instead:

int j=threadIdx.x;

int i=threadIdx.y;

I set those two variables in the CUDA kernel the way you DID set them, but I use i as row_id and j as column_id while in your code i is for columns and j is for rows.

I’ll try to implement and test the code over here to play around with it and see what’s wrong (is there anything else needed besides what you posted here ?).

Hello Panajev,

I think I put you into trouble! :D Just kidding of course!

I assign the way you saw i and j, because I thought that threadIdx.x gives the width (i.e. columns) and threadIdx.y gives the height (i.e. rows). Unfortunately I had to leave my office where all my files are, so I cannot try the way you suggest now. But I will do that tomorrow.

I don’t think you need anything else except to declare the variables. I wish I had the files with me now to post them here directly.

Btw, I tried also to do the same calculations, but this time using cudamalloc and cudamemcpy. In other words, I considered my 2D array as a 1D linear one and performed the same steps as I said before and everything worked perfectly! However it would be really convenient to know why the program is not working with cudamallocpitch and cudamemcpy2d.

If you run the code, could you please let me know if the results are ok?

Thank you again very much!

Hi dtheodor,

It will be much simple if you can use 2D indexing within the kernel.

I’m attaching a very simple matrix addition source code. You may understand and modify them as required.

~Sibi

External Image

MatrixAddition.zip (885 Bytes)

Hello Sibi A,

Thank you very much for your attached code. I see that you also avoid using cudaMallocPitch and cudaMemCpy2D to do the 2D matrix addition. :) I did also the same and now everything works fine. However it would be nice to know how to use these 2D functions. If anyone has a solid example on these functions, it would be nice to be posted for newbies like me! :)

Kind regards,

dtheodor

I think the problem is that you have to use the pitch in your indexing. Since it is padded to the pitch length, your index calculation becomes wrong.