Hello everyone,

I want to make a simple addition between 2 two 2D matrices Agpu and Bgpu, each one having 5 columns and 4 rows, and store it to another matrix called Cgpu. I also want to exploit the GPU’s parallel execution benefit, so I use 1 block with dimensions dim3 dimBlock (5,4). These are the 5 steps that I perform:

//1. GPU memory allocation for matrices Agpu, Bgpu and Cgpu

CUDA_SAFE_CALL(cudaMallocPitch((void**)&Agpu,&Apitch,5*sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Bgpu,&Bpitch,5*sizeof(float),4));

CUDA_SAFE_CALL(cudaMallocPitch((void**)&Cgpu,&Cpitch,5*sizeof(float),4));

//2. Transfer data from host matrices A, B, C to device matrices Agpu, Bgpu, Cgpu

cudaMemcpy2D(Agpu,Apitch,A,5*sizeof(float),5*sizeof(float),4

,cudaMemcpyHostToDevice);

cudaMemcpy2D(Bgpu,Apitch,B,5*sizeof(float),5*sizeof(float),4

,cudaMemcpyHostToDevice);

//3. Divide block to 5 columns and 4 rows

dim3 dimBlock (5,4);

//4. call the kernel

mat_add<<<1,dimBlock>>>(Agpu,Bgpu,Cgpu);

//5. copy back the result from device Cgpu matrix to host C matrix

cudaMemcpy2D(C,5*sizeof(float),Cgpu,Cpitch,5*sizeof(float),4

,cudaMemcpyDeviceToHost

//kernel

**global** void mat_add(float *A,float *B,float *C)

{

int i=threadIdx.x;

int j=threadIdx.y;

C[i+j*5]=A[i+j*5]+B[i+j*5];

}

However only half of the C matrix elements are correct!! :wacko:

Also, another thing that I noticed is that if print the allocated pitch from cudaMallocPitch is 64. Since pitch is the width allocated (in bytes), and I allocate 5*sizeof(float), shouldn’t it be 5*4bytes=20?

Can anyone suggest some advice?

Kind regards,

dtheodor