Problem with memory access or thread synchronization The code works well in emulation mode and give

zaxwan · November 13, 2009, 11:39am

Hi,

I am a new user, and have a problem with my code. The code works well in emulation mode and gives correct result, but not in real execution. I am testing matrix multiplication for very small sizes (though not a very good idea on GPU).

The code is a bit specialized for matrix multiplication (using IKJ) with size 4x4, and is called with (2,2) blocks and (2,4) threads within each block.
Instead of using I=1 to 4,K=1 to4, J= 1 to 4, I am using I=1 to 4, J=1 to 8, K=1 to 2, assuming I to work with blocks, K for the loop in the code, and J for the threads.
The indices of the elements of the matrices are correct while multiplying, and correct result is produced in emulation mode. However, it does not produce correct result during real execution.

Can anybody guide about the error while synchronizing or memory access (writing).

Thanks

Regards
ZaxWan

///-------------------------

// Code for matrix multiply IKJ called valid for size 4
// dim3 grid(2,2); dim3 block(2,4); <<<grid, block>>(…)

#define TYPE float

global void sgemm( int wA, int wB, int wC, TYPE alpha, const TYPE A, int lda, const TYPE B, int ldb, TYPE beta, TYPE C, int ldc)
{
//Dimensions of all the matrices 44, wAlda, wBldb, wC*ldc

int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int SIZEN = wA; int SIZEK = wB; int SIZEM = wC;

int i = bx * gridDim.y + by; // get block#

int j = (tx * blockDim.y + ty) ; // get thread#
int kk = (tx * blockDim.y + ty);

int SIZEN2 = SIZEN >> 1; // size for k-loop, i.e. 2 instead of 4 for a matrix with dimension 4x4
int k =0;
// jj and kk will be used for adjusting indices due to change in normal iteration count as mentioned above

int jj = j % SIZEK;

do
{

int kk = 2 * k + j/SIZEK;
C[i*SIZEK+jj] += A[i*SIZEN+kk] * B[kk*SIZEK+jj];

__syncthreads();
k++;
} while(k < SIZEN2);
}

tonhead · November 13, 2009, 11:54am

I’m not sure if that is the case, but try passing [font=“Courier New”]-Xopencc -O0[/font] flag to the nvcc compiler (in VS add this to the command line arguments in CUDA Build Rule)

zaxwan · November 13, 2009, 2:39pm

Even by passing the option -Xopencc -O0, the result is the same as previous, i.e. incorrect.

In the loop, following indices are produced correct while calculating 1st element of resultant matrix C, but still the output is not correct.

i=0, k=0, j=0 :: C[0] += A[0] * B[0]

i=0, k=0, j=4 :: C[0] += A[1] * B[4]

i=0, k=1, j=0 :: C[0] += A[2] * B[8]

i=0, k=1, j=4 :: C[0] += A[3] * B[12]

… trace for other elements

Thread id 4 (j=4) is calculating the sum of last 2 elements, and this is also the result. The elements calculated by thread with id 0 (j=0) are not added to the result. What am i missing? Is the block/grid configuration fine?

Topic		Replies	Views
Different result between emulation and real intractable bug CUDA Programming and Performance	2	4176	December 13, 2007
Number of threads affecting answer; this should not happen a VERY strange error.. CUDA Programming and Performance	8	2498	July 17, 2009
kernel problem CUDA Programming and Performance	6	2770	August 15, 2008
performance issue CUDA Programming and Performance	21	10383	April 30, 2007
matrix multiplication CUDA Programming and Performance	10	3858	March 7, 2010
Matrix Multiplication Buggy CUDA Programming and Performance	13	5250	May 5, 2010
Vector Vector Multiplication Code : Error Basic Vector Vector Multiplication code CUDA Programming and Performance	5	7406	May 21, 2008
memory function does not see the memory, although the device array is copied CUDA Programming and Performance	2	2564	May 29, 2010
the result is not always right ! CUDA Programming and Performance	2	1670	April 6, 2008
Simple kernel problem A question about debugging a simple kernel CUDA Programming and Performance	2	2975	November 11, 2009

Problem with memory access or thread synchronization The code works well in emulation mode and give

Related topics