Problem with memory access or thread synchronization The code works well in emulation mode and give


I am a new user, and have a problem with my code. The code works well in emulation mode and gives correct result, but not in real execution. I am testing matrix multiplication for very small sizes (though not a very good idea on GPU).

The code is a bit specialized for matrix multiplication (using IKJ) with size 4x4, and is called with (2,2) blocks and (2,4) threads within each block.
Instead of using I=1 to 4,K=1 to4, J= 1 to 4, I am using I=1 to 4, J=1 to 8, K=1 to 2, assuming I to work with blocks, K for the loop in the code, and J for the threads.
The indices of the elements of the matrices are correct while multiplying, and correct result is produced in emulation mode. However, it does not produce correct result during real execution.

Can anybody guide about the error while synchronizing or memory access (writing).




// Code for matrix multiply IKJ called valid for size 4
// dim3 grid(2,2); dim3 block(2,4); <<<grid, block>>(…)

#define TYPE float

global void sgemm( int wA, int wB, int wC, TYPE alpha, const TYPE A, int lda, const TYPE B, int ldb, TYPE beta, TYPE C, int ldc)
//Dimensions of all the matrices 4
4, wA
lda, wB
ldb, wC*ldc

int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int SIZEN = wA; int SIZEK = wB; int SIZEM = wC;

int i = bx * gridDim.y + by; // get block#

int j = (tx * blockDim.y + ty) ; // get thread#
int kk = (tx * blockDim.y + ty);

int SIZEN2 = SIZEN >> 1; // size for k-loop, i.e. 2 instead of 4 for a matrix with dimension 4x4
int k =0;
// jj and kk will be used for adjusting indices due to change in normal iteration count as mentioned above

int jj = j % SIZEK;


int kk = 2 * k + j/SIZEK;
C[i*SIZEK+jj] += A[i*SIZEN+kk] * B[kk*SIZEK+jj];

} while(k < SIZEN2);

I’m not sure if that is the case, but try passing [font=“Courier New”]-Xopencc -O0[/font] flag to the nvcc compiler (in VS add this to the command line arguments in CUDA Build Rule)

Even by passing the option -Xopencc -O0, the result is the same as previous, i.e. incorrect.

In the loop, following indices are produced correct while calculating 1st element of resultant matrix C, but still the output is not correct.

i=0, k=0, j=0 :: C[0] += A[0] * B[0]

i=0, k=0, j=4 :: C[0] += A[1] * B[4]

i=0, k=1, j=0 :: C[0] += A[2] * B[8]

i=0, k=1, j=4 :: C[0] += A[3] * B[12]

… trace for other elements

Thread id 4 (j=4) is calculating the sum of last 2 elements, and this is also the result. The elements calculated by thread with id 0 (j=0) are not added to the result. What am i missing? Is the block/grid configuration fine?