I am a new user, and have a problem with my code. The code works well in emulation mode and gives correct result, but not in real execution. I am testing matrix multiplication for very small sizes (though not a very good idea on GPU).

The code is a bit specialized for matrix multiplication (using IKJ) with size 4x4, and is called with (2,2) blocks and (2,4) threads within each block.

Instead of using I=1 to 4,K=1 to4, J= 1 to 4, I am using I=1 to 4, J=1 to 8, K=1 to 2, assuming I to work with blocks, K for the loop in the code, and J for the threads.

The indices of the elements of the matrices are correct while multiplying, and correct result is produced in emulation mode. However, it does not produce correct result during real execution.

Can anybody guide about the error while synchronizing or memory access (writing).

// Code for matrix multiply IKJ called valid for size 4

// dim3 grid(2,2); dim3 block(2,4); <<<grid, block>>(…)

#define TYPE float

**global** void sgemm( int wA, int wB, int wC, TYPE alpha, const TYPE *A, int lda, const TYPE B, int ldb, TYPE beta, TYPE C, int ldc)*ldb, wC*ldc

{

//Dimensions of all the matrices 44, wAlda, wB

int bx = blockIdx.x;

int by = blockIdx.y;

int tx = threadIdx.x;

int ty = threadIdx.y;

int SIZEN = wA; int SIZEK = wB; int SIZEM = wC;

int i = bx * gridDim.y + by; // get block#

int j = (tx * blockDim.y + ty) ; // get thread#

int kk = (tx * blockDim.y + ty);

int SIZEN2 = SIZEN >> 1; // size for k-loop, i.e. 2 instead of 4 for a matrix with dimension 4x4

int k =0;

// jj and kk will be used for adjusting indices due to change in normal iteration count as mentioned above

int jj = j % SIZEK;

do

{

int kk = 2 * k + j/SIZEK;

C[i*SIZEK+jj] += A[i*SIZEN+kk] * B[kk*SIZEK+jj];

__syncthreads();

k++;

} while(k < SIZEN2);

}