Hello everyone,
I want to make a simple addition between 2 two 2D matrices Agpu and Bgpu, each one having 5 columns and 4 rows, and store it to another matrix called Cgpu. I also want to exploit the GPU’s parallel execution benefit, so I use 1 block with dimensions dim3 dimBlock (5,4). These are the 5 steps that I perform:
//1. GPU memory allocation for matrices Agpu, Bgpu and Cgpu
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Agpu,&Apitch,5sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Bgpu,&Bpitch,5sizeof(float),4));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&Cgpu,&Cpitch,5*sizeof(float),4));
//2. Transfer data from host matrices A, B, C to device matrices Agpu, Bgpu, Cgpu
cudaMemcpy2D(Agpu,Apitch,A,5sizeof(float),5sizeof(float),4
,cudaMemcpyHostToDevice);
cudaMemcpy2D(Bgpu,Apitch,B,5sizeof(float),5sizeof(float),4
,cudaMemcpyHostToDevice);
//3. Divide block to 5 columns and 4 rows
dim3 dimBlock (5,4);
//4. call the kernel
mat_add<<<1,dimBlock>>>(Agpu,Bgpu,Cgpu);
//5. copy back the result from device Cgpu matrix to host C matrix
cudaMemcpy2D(C,5sizeof(float),Cgpu,Cpitch,5sizeof(float),4
,cudaMemcpyDeviceToHost
//kernel
global void mat_add(float *A,float *B,float *C)
{
int i=threadIdx.x;
int j=threadIdx.y;
C[i+j*5]=A[i+j*5]+B[i+j*5];
}
However only half of the C matrix elements are correct!! :wacko:
Also, another thing that I noticed is that if print the allocated pitch from cudaMallocPitch is 64. Since pitch is the width allocated (in bytes), and I allocate 5sizeof(float), shouldn’t it be 54bytes=20?
Can anyone suggest some advice?
Kind regards,
dtheodor