Dear all CUDA developers,

I just started programming CUDA, and here I`m stuck at a very-very simple matrix multiplication code. The problem is, the result between emulation and device code is different. What should I do? Any help will be appreciated.

Thank you very much.

Host function:

```
void Matrix::MulDevice(const Matrix& A, const Matrix& B)
{
dim3 dimBlock(A.width, B.height);
dim3 dimGrid(1,1);
#ifdef __DEVICE_EMULATION__
printf("beginning multiplication\n");
#endif
MatrixMulKernelV1<<<dimGrid, dimBlock>>>(this, &A, &B);
#ifdef __DEVICE_EMULATION__
printf("end multiplication\n");
#endif
}
```

Kernel function:

```
__global__ void MatrixMulKernelV1(Matrix* R, const Matrix* A, const Matrix* B)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
#ifdef __DEVICE_EMULATION__
printf("thread %d,%d \n", tx, ty);
#endif
float pValue = 0.0f;
for (int k = 0; k < A->width; k++)
{
float AElement = MatrixGetElmt(A, k, ty);
float BElement = MatrixGetElmt(B, tx, k);
pValue += AElement * BElement;
}
MatrixSetElmt(R, tx, ty, pValue);
}
```