2D matrix addition question

dtheodor · November 30, 2008, 3:45pm

Hello everyone,

I want to make a simple addition between 2 two 2D matrices Agpu and Bgpu, each one having 5 columns and 4 rows, and store it to another matrix called Cgpu. I also want to exploit the GPU’s parallel execution benefit, so I use 1 block with dimensions dim3 dimBlock (5,4). These are the 5 steps that I perform:

//1. GPU memory allocation for matrices Agpu, Bgpu and Cgpu
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Agpu,&Apitch,5sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Bgpu,&Bpitch,5sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Cgpu,&Cpitch,5*sizeof(float),4));

//2. Transfer data from host matrices A, B, C to device matrices Agpu, Bgpu, Cgpu
cudaMemcpy2D(Agpu,Apitch,A,5sizeof(float),5sizeof(float),4
,cudaMemcpyHostToDevice);
cudaMemcpy2D(Bgpu,Apitch,B,5sizeof(float),5sizeof(float),4
,cudaMemcpyHostToDevice);

//3. Divide block to 5 columns and 4 rows
dim3 dimBlock (5,4);

//4. call the kernel
mat_add<<<1,dimBlock>>>(Agpu,Bgpu,Cgpu);

//5. copy back the result from device Cgpu matrix to host C matrix
cudaMemcpy2D(C,5sizeof(float),Cgpu,Cpitch,5sizeof(float),4
,cudaMemcpyDeviceToHost

//kernel
global void mat_add(float *A,float *B,float *C)
{
int i=threadIdx.x;
int j=threadIdx.y;
C[i+j*5]=A[i+j*5]+B[i+j*5];
}

However only half of the C matrix elements are correct!! :wacko:
Also, another thing that I noticed is that if print the allocated pitch from cudaMallocPitch is 64. Since pitch is the width allocated (in bytes), and I allocate 5sizeof(float), shouldn’t it be 54bytes=20?

Can anyone suggest some advice?

Kind regards,
dtheodor

Panajev · November 30, 2008, 4:56pm

Hello everyone,

I want to make a simple addition between 2 two 2D matrices Agpu and Bgpu, each one having 5 columns and 4 rows, and store it to another matrix called Cgpu. I also want to exploit the GPU’s parallel execution benefit, so I use 1 block with dimensions dim3 dimBlock (5,4). These are the 5 steps that I perform:

//1. GPU memory allocation for matrices Agpu, Bgpu and Cgpu

CUDA_SAFE_CALL(cudaMallocPitch((void**)&Agpu,&Apitch,5*sizeof(float),4));

CUDA_SAFE_CALL(cudaMallocPitch((void**)&Bgpu,&Bpitch,5*sizeof(float),4));

CUDA_SAFE_CALL(cudaMallocPitch((void**)&Cgpu,&Cpitch,5*sizeof(float),4));

//2. Transfer data from host matrices A, B, C to device matrices Agpu, Bgpu, Cgpu

cudaMemcpy2D(Agpu,Apitch,A,5sizeof(float),5sizeof(float),4

,cudaMemcpyHostToDevice);

cudaMemcpy2D(Bgpu,Apitch,B,5sizeof(float),5sizeof(float),4

,cudaMemcpyHostToDevice);

//3. Divide block to 5 columns and 4 rows

dim3 dimBlock (5,4);

//4. call the kernel

mat_add<<<1,dimBlock>>>(Agpu,Bgpu,Cgpu);

//5. copy back the result from device Cgpu matrix to host C matrix

cudaMemcpy2D(C,5sizeof(float),Cgpu,Cpitch,5sizeof(float),4

,cudaMemcpyDeviceToHost

//kernel

global void mat_add(float *A,float *B,float *C)

{
int i=threadIdx.x;

int j=threadIdx.y;

C[i+j*5]=A[i+j*5]+B[i+j*5];
}

However only half of the C matrix elements are correct!! :wacko:

Also, another thing that I noticed is that if print the allocated pitch from cudaMallocPitch is 64. Since pitch is the width allocated (in bytes), and I allocate 5sizeof(float), shouldn’t it be 54bytes=20?

Can anyone suggest some advice?

Kind regards,

dtheodor

Assuming you are acessing your matrices in a column major format shouldn’t this line:

C[i+j*5]=A[i+j*5]+B[i+j*5];

be

C[i+j*4]=A[i+j*4]+B[i+j*4];

?

Here are the macros I am using to access matrices:

define indexR(i, j, n_cols) ((j) + ((i) * (n_cols))) //row-major matrix

define indexC(i ,j, n_rows) (((j) * (n_rows)) + (i)) //column major order addressing + 1st element has id#0

dtheodor · November 30, 2008, 5:37pm

Hello Panajev,

Thank you very much for your reply.

Actually I’m following a row major format, so I think this line should be ok.

This is how I print the data after they have been copied back from the GPU:

for (j=0;j<4;j++) //rows iterator

{

[indent]for (i=0;i<5;i++) //columns iterator

printf(“\nC[%d]=%f”,i+j*5,C[i+j*5]);[/indent]

}

The funny thing is that if I increase the block boundaries (e.g. dim3 dimBlock (20,16)) the results are ok! :blink:

Since I am new to CUDA programming, I was trying to make sure that I have allocated the proper memory size and performed correctly the cudaMemCpy2d. But I cannot find something wrong…

Panajev · November 30, 2008, 6:42pm

If you are following a row major format then you want (i * NUMBER_OF_COLUMNS + j ) and not (j * NUMBER_OF_COLUMNS + i), but that might not be the problem.

… wait, you are using i for the columns and j for the rows? Ok, so that is not the problem…

but… you assign i and j this way…

int i=threadIdx.x;

int j=threadIdx.y;

could you try to set them this way instead:

int j=threadIdx.x;

int i=threadIdx.y;

I set those two variables in the CUDA kernel the way you DID set them, but I use i as row_id and j as column_id while in your code i is for columns and j is for rows.

I’ll try to implement and test the code over here to play around with it and see what’s wrong (is there anything else needed besides what you posted here ?).

dtheodor · November 30, 2008, 7:05pm

Hello Panajev,

I think I put you into trouble! :D Just kidding of course!

I assign the way you saw i and j, because I thought that threadIdx.x gives the width (i.e. columns) and threadIdx.y gives the height (i.e. rows). Unfortunately I had to leave my office where all my files are, so I cannot try the way you suggest now. But I will do that tomorrow.

I don’t think you need anything else except to declare the variables. I wish I had the files with me now to post them here directly.

Btw, I tried also to do the same calculations, but this time using cudamalloc and cudamemcpy. In other words, I considered my 2D array as a 1D linear one and performed the same steps as I said before and everything worked perfectly! However it would be really convenient to know why the program is not working with cudamallocpitch and cudamemcpy2d.

If you run the code, could you please let me know if the results are ok?

Thank you again very much!

Sibi_A · December 1, 2008, 4:25am

Hi dtheodor,

It will be much simple if you can use 2D indexing within the kernel.

I’m attaching a very simple matrix addition source code. You may understand and modify them as required.

~Sibi

External Image

MatrixAddition.zip (885 Bytes)

dtheodor · December 2, 2008, 12:15pm

Hello Sibi A,

Thank you very much for your attached code. I see that you also avoid using cudaMallocPitch and cudaMemCpy2D to do the 2D matrix addition. :) I did also the same and now everything works fine. However it would be nice to know how to use these 2D functions. If anyone has a solid example on these functions, it would be nice to be posted for newbies like me! :)

Kind regards,

dtheodor

pcrs · May 17, 2009, 10:47am

I think the problem is that you have to use the pitch in your indexing. Since it is padded to the pitch length, your index calculation becomes wrong.

Topic		Replies	Views
need help for cudaMemcpy2D() CUDA Programming and Performance	5	4582	December 8, 2009
problem with cudaMallocPitch and cudaMemcpy2D CUDA Programming and Performance	5	6369	April 22, 2009
he;p regarding matrix multiplication using pitch CUDA Programming and Performance	3	3000	May 20, 2010
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26468	March 17, 2011
Copying 2D array from host to device CUDA Programming and Performance	7	7289	July 27, 2010
problem with cudaMallocPitch and cudaMemcpy2D CUDA Programming and Performance	4	21768	August 12, 2010
Matrix Addition CUDA Programming and Performance	1	1147	June 14, 2012
2matrix addition CUDA Programming and Performance	3	919	April 28, 2010
cudaMemcpy2D / Grid size / MxN double matrix Problem copying a MxN double matrix from Host to Device CUDA Programming and Performance	3	1323	March 18, 2010
CUDA Matrix Addition - 2D Memory, threads and blocks in 2D Matrix Addition in CUDA C using Texture a CUDA Programming and Performance	1	14710	November 27, 2011

2D matrix addition question

Related topics