Unexpected behaviour of matrix multiply demo

indy1000 · November 11, 2010, 11:49am

Hi -

So I’ve written a demo of different matrix multiplication kernels (straight read from global mem (1), shared memory usage (2), shared memory + coalesced reads (3)). (see attached).

Unfortunately:

(a) When I run it on my C2050 (whether or not compiled with arch=sm_20 flag) it crashes after the second cudaThreadSynchronize() call with failure “ERROR: Sync2: unspecified launch failure”

(b) When I run it on my GTX285 it runs fine but I get the wrong results! To multiply 2 400x400 matrices using kernel (1) takes 0.003s, (2) takes 0.035s, (3) takes 0.101s

This is all running on an up-to-date ubuntu release with the latest CUDA drivers, compilers and libraries.

Does anybody know what I’m doing wrong?

Many thanks,

David
multiply_matrices_gpu.cu (5.13 KB)

indy1000 · November 11, 2010, 11:49am

Hi -

So I’ve written a demo of different matrix multiplication kernels (straight read from global mem (1), shared memory usage (2), shared memory + coalesced reads (3)). (see attached).

Unfortunately:

(a) When I run it on my C2050 (whether or not compiled with arch=sm_20 flag) it crashes after the second cudaThreadSynchronize() call with failure “ERROR: Sync2: unspecified launch failure”

(b) When I run it on my GTX285 it runs fine but I get the wrong results! To multiply 2 400x400 matrices using kernel (1) takes 0.003s, (2) takes 0.035s, (3) takes 0.101s

This is all running on an up-to-date ubuntu release with the latest CUDA drivers, compilers and libraries.

Does anybody know what I’m doing wrong?

Many thanks,

David

LSChien · November 11, 2010, 1:37pm

shared memory is out of array bound.

You need to check your indices on shared memory.

__global__ void matrixMulShared(float *A_gpu,float *B_gpu,float *C_gpu,int rowLength){

__shared__ float A_tile[16*16];

    __shared__ float B_tile[16*16];

float sum = 0;

for (int tileIdx = 0; tileIdx < (rowLength/16); tileIdx++){

int i = blockIdx.y * blockDim.y + threadIdx.y;

        int j = tileIdx * blockDim.x + threadIdx.x;

        A_tile[threadIdx.y*16 + threadIdx.x] = A_gpu[i*rowLength+j];

        B_tile[threadIdx.x*16 + threadIdx.y] = B_gpu[j*rowLength+i];  // Non coalesced

        __syncthreads();

for (int k = 0; k<16; k++){

  //          sum += + A_tile[threadIdx.y*blockDim.y+k] * B_tile[k*blockDim.x+threadIdx.x];

            sum += A_tile[threadIdx.y*16+k] * B_tile[k*16+threadIdx.x];

        }

        __syncthreads();

}

    int i = blockIdx.y * blockDim.y + threadIdx.y;

    int j = blockIdx.x * blockDim.x + threadIdx.x;

    C_gpu[i*rowLength+j] = sum;

}

LSChien · November 11, 2010, 1:37pm

shared memory is out of array bound.

You need to check your indices on shared memory.

__global__ void matrixMulShared(float *A_gpu,float *B_gpu,float *C_gpu,int rowLength){

__shared__ float A_tile[16*16];

    __shared__ float B_tile[16*16];

float sum = 0;

for (int tileIdx = 0; tileIdx < (rowLength/16); tileIdx++){

int i = blockIdx.y * blockDim.y + threadIdx.y;

        int j = tileIdx * blockDim.x + threadIdx.x;

        A_tile[threadIdx.y*16 + threadIdx.x] = A_gpu[i*rowLength+j];

        B_tile[threadIdx.x*16 + threadIdx.y] = B_gpu[j*rowLength+i];  // Non coalesced

        __syncthreads();

for (int k = 0; k<16; k++){

  //          sum += + A_tile[threadIdx.y*blockDim.y+k] * B_tile[k*blockDim.x+threadIdx.x];

            sum += A_tile[threadIdx.y*16+k] * B_tile[k*16+threadIdx.x];

        }

        __syncthreads();

}

    int i = blockIdx.y * blockDim.y + threadIdx.y;

    int j = blockIdx.x * blockDim.x + threadIdx.x;

    C_gpu[i*rowLength+j] = sum;

}

indy1000 · November 11, 2010, 2:05pm

Many thanks,

However this doesn’t fix either problem.

Best, David

indy1000 · November 11, 2010, 2:05pm

Many thanks,

However this doesn’t fix either problem.

Best, David

indy1000 · November 11, 2010, 2:31pm

Ok I’ve fixed it now - I missed your modification to the line : “for (int k = 0; k<16; k++){”

Many thanks for your help,

David

indy1000 · November 11, 2010, 2:31pm

Ok I’ve fixed it now - I missed your modification to the line : “for (int k = 0; k<16; k++){”

Many thanks for your help,

David

Topic		Replies	Views
Take Garbage Value wrong output how to use shared memory in a program CUDA Programming and Performance	2	5035	December 23, 2009
Shared memory error CUDA Programming and Performance	1	949	June 24, 2012
Problem with shared memory CUDA Programming and Performance	6	998	October 23, 2015
multiplication of matrix using shared memory problem of multiplication CUDA Programming and Performance	2	3990	September 30, 2010
error in the result of using shared memory CUDA Programming and Performance	2	614	May 29, 2015
BUG? shared memory using in matrixMul CUDA Programming and Performance	0	1100	October 15, 2009
Some help needed with shared memory and program correctness matrix * vector operation CUDA Programming and Performance	1	1167	November 30, 2008
Checking Performance 2Âº round Trying to reproduce the results ..... CUDA Programming and Performance	4	1786	October 15, 2008
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10740	February 8, 2008
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1869	October 5, 2018

Unexpected behaviour of matrix multiply demo

Related topics