this is output of device query

```
Device 0: "GeForce 9500 GT"
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073414144 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.38 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Vers ion = 3.10, NumDevs = 1, Device = GeForce 9500 GT
PASSED
```

something wrong I note in original kernel version of SDK matric multiplication.

If size A = 320x640 and B= 640x320 the grid size will be 20x20, the test of correctness pass, and I have 20 gflops !!

But if size A=640x540 and B 640x320 the grid is not square and results are wrong, test Fail !!

My version of this kernel work fine of not square grid too, but it’s more slow. Why ?

below original kernel in cuda sdk, the difference are only in indexing of threads…

```
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Declaration of the shared memory array As used to
// store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs used to
// store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
AS(ty,tx) = A[a + wA * ty + tx];
BS(ty,tx) = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty,k) * BS(k,tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
```

this is my version

```
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int aBegin = wA * BLOCK_SIZE * bx;
int aEnd = aBegin + wA - 1;
int aStep = BLOCK_SIZE;
int bBegin = BLOCK_SIZE * by;
int bStep = BLOCK_SIZE * wB;
float Csub = 0;
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
AS(tx,ty)=A[a + wA * tx + ty];
BS(tx,ty) = B[b + wB * tx + ty];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; k++)
Csub += AS(tx,k) * BS(k,ty);
__syncthreads();
}
int c=wB * BLOCK_SIZE * bx + BLOCK_SIZE*by;
C[c + wB * tx + ty] = Csub;
}
```