I want to make a simple addition between 2 two 2D matrices Agpu and Bgpu, each one having 5 columns and 4 rows, and store it to another matrix called Cgpu. I also want to exploit the GPU’s parallel execution benefit, so I use 1 block with dimensions dim3 dimBlock (5,4). These are the 5 steps that I perform:
//1. GPU memory allocation for matrices Agpu, Bgpu and Cgpu
//2. Transfer data from host matrices A, B, C to device matrices Agpu, Bgpu, Cgpu
//3. Divide block to 5 columns and 4 rows
dim3 dimBlock (5,4);
//4. call the kernel
//5. copy back the result from device Cgpu matrix to host C matrix
global void mat_add(float *A,float *B,float *C)
However only half of the C matrix elements are correct!! :wacko:
Also, another thing that I noticed is that if print the allocated pitch from cudaMallocPitch is 64. Since pitch is the width allocated (in bytes), and I allocate 5sizeof(float), shouldn’t it be 54bytes=20?
Can anyone suggest some advice?