Unexpected behaviour of matrix multiply demo

Hi -

So I’ve written a demo of different matrix multiplication kernels (straight read from global mem (1), shared memory usage (2), shared memory + coalesced reads (3)). (see attached).

Unfortunately:

(a) When I run it on my C2050 (whether or not compiled with arch=sm_20 flag) it crashes after the second cudaThreadSynchronize() call with failure “ERROR: Sync2: unspecified launch failure”

(b) When I run it on my GTX285 it runs fine but I get the wrong results! To multiply 2 400x400 matrices using kernel (1) takes 0.003s, (2) takes 0.035s, (3) takes 0.101s

This is all running on an up-to-date ubuntu release with the latest CUDA drivers, compilers and libraries.

Does anybody know what I’m doing wrong?

Many thanks,

David
multiply_matrices_gpu.cu (5.13 KB)

Hi -

So I’ve written a demo of different matrix multiplication kernels (straight read from global mem (1), shared memory usage (2), shared memory + coalesced reads (3)). (see attached).

Unfortunately:

(a) When I run it on my C2050 (whether or not compiled with arch=sm_20 flag) it crashes after the second cudaThreadSynchronize() call with failure “ERROR: Sync2: unspecified launch failure”

(b) When I run it on my GTX285 it runs fine but I get the wrong results! To multiply 2 400x400 matrices using kernel (1) takes 0.003s, (2) takes 0.035s, (3) takes 0.101s

This is all running on an up-to-date ubuntu release with the latest CUDA drivers, compilers and libraries.

Does anybody know what I’m doing wrong?

Many thanks,

David

shared memory is out of array bound.

You need to check your indices on shared memory.

__global__ void matrixMulShared(float *A_gpu,float *B_gpu,float *C_gpu,int rowLength){

__shared__ float A_tile[16*16];

    __shared__ float B_tile[16*16];

float sum = 0;

for (int tileIdx = 0; tileIdx < (rowLength/16); tileIdx++){

int i = blockIdx.y * blockDim.y + threadIdx.y;

        int j = tileIdx * blockDim.x + threadIdx.x;

        A_tile[threadIdx.y*16 + threadIdx.x] = A_gpu[i*rowLength+j];

        B_tile[threadIdx.x*16 + threadIdx.y] = B_gpu[j*rowLength+i];  // Non coalesced

        __syncthreads();

for (int k = 0; k<16; k++){

  //          sum += + A_tile[threadIdx.y*blockDim.y+k] * B_tile[k*blockDim.x+threadIdx.x];

            sum += A_tile[threadIdx.y*16+k] * B_tile[k*16+threadIdx.x];

        }

        __syncthreads();

}

    int i = blockIdx.y * blockDim.y + threadIdx.y;

    int j = blockIdx.x * blockDim.x + threadIdx.x;

    C_gpu[i*rowLength+j] = sum;

}

shared memory is out of array bound.

You need to check your indices on shared memory.

__global__ void matrixMulShared(float *A_gpu,float *B_gpu,float *C_gpu,int rowLength){

__shared__ float A_tile[16*16];

    __shared__ float B_tile[16*16];

float sum = 0;

for (int tileIdx = 0; tileIdx < (rowLength/16); tileIdx++){

int i = blockIdx.y * blockDim.y + threadIdx.y;

        int j = tileIdx * blockDim.x + threadIdx.x;

        A_tile[threadIdx.y*16 + threadIdx.x] = A_gpu[i*rowLength+j];

        B_tile[threadIdx.x*16 + threadIdx.y] = B_gpu[j*rowLength+i];  // Non coalesced

        __syncthreads();

for (int k = 0; k<16; k++){

  //          sum += + A_tile[threadIdx.y*blockDim.y+k] * B_tile[k*blockDim.x+threadIdx.x];

            sum += A_tile[threadIdx.y*16+k] * B_tile[k*16+threadIdx.x];

        }

        __syncthreads();

}

    int i = blockIdx.y * blockDim.y + threadIdx.y;

    int j = blockIdx.x * blockDim.x + threadIdx.x;

    C_gpu[i*rowLength+j] = sum;

}

Many thanks,

However this doesn’t fix either problem.

Best, David

Many thanks,

However this doesn’t fix either problem.

Best, David

Ok I’ve fixed it now - I missed your modification to the line : “for (int k = 0; k<16; k++){”

Many thanks for your help,

David

Ok I’ve fixed it now - I missed your modification to the line : “for (int k = 0; k<16; k++){”

Many thanks for your help,

David