Some help needed with shared memory and program correctness matrix * vector operation

Panajev · November 30, 2008, 12:38pm

What I have set-up this code to do (works as intended with SHARED_MEM == 0, using just global memory)

is to take a matrix and divide it into blocks of rows. Each block of the matrix is assigned to a CUDA Thread Block

and each row of that block is assigned to a thread of the Thread Block.

Shared memory is set-up to store a block of rows of that matrix and each Thread Block should have its own shared memory pool

(Thread Blocks do not share shared memory).

The code produces corrects results with shared memory “turned on” up until element #17 (TBLOCK = 16, ROWS = 2 * 16, COLS = 2 * 16).

(2 Thread Blocks with 16 threads each)

(source matrix is 32x32, source vector is 32x1, result vector is 32x1)

Here is the code:

__global__ void MatTest(

	float *d_C,

	float *d_A,

	float *d_B

){

	// Block index

	int bx = blockIdx.x;

	// Thread index

	int tx = threadIdx.x;

	int i = tx + (bx* blockDim.x);

	//printf ("tx: %d\n", tx);

	//printf ("bx: %d\n", bx);

	//printf ("i: %d\n", i);

#if SHARED_MEM == 1

	__shared__ float matA[TBLOCK * COLS];

	//__shared__ float vecB[ROWS];

	//__shared__ float vecC[ROWS];

	int c;

	for (c = 0; c < COLS; c++) {

		matA[indexC(tx,c, COLS)] = d_A[indexC(i,c,COLS)];

		//vecB[c] = d_B[c];

		//vecC[c] = 0;

	}

	__syncthreads();

	float t = 0;

	

	for (c = 0; c < COLS; c++) {

	

		t += matA[indexC(tx,c,COLS)] * d_B[c];

		

		//d_C[i] = 2; 

		

	}

	

	__syncthreads();

	d_C[i] = t;

	

	//this kernel uses  bytes of Shared Memory and  bit registers...

#endif

#if SHARED_MEM == 0

	for (int c = 0; c < COLS; c++) {

		d_C[i] += d_A[indexC(i,c,COLS)] * d_B[c];

		//this kernel uses  bytes of Shared Memory and  bit registers...

	}

#endif

	//printf("d_C[%d]: %f\n", i, d_C[i]);

}

I have been thinking over it quite a bit, but I cannot find any obvious error so I’d gladly take some advice…

thanks in advance,

Panajev

Panajev · November 30, 2008, 2:46pm

Fixed:

I am an idiot :P.

Executing in emudebug mode helped see an out of boundaries access issue that in turn helped me see that the macro indexC suggested the wrong argument (NCOLS instead of NROWS):

(corrected MACRO’s)

#define indexR(i, j, n_cols) ((j) + ((i) * (n_cols))) //row-major matrix
#define indexC(i ,j, n_rows) (((j) * (n_rows)) + (i)) //column major order addressing + 1st element has id#0

so instead of this:

matA[indexC(tx,c, TBLOCK)]…

I was doing…

matA[indexC(tx,c, NCOLS)] …

in the two loops.

Also… using COLS instead of ROWS was not giving a problem only because the matrix is square…
nVIDIA, thanks for the emudebug mode :D.

Sorry for wasting your time with a dumb error :(…

Topic		Replies	Views
Shared memory error CUDA Programming and Performance	1	958	June 24, 2012
Problem with shared memory CUDA Programming and Performance	6	1018	October 23, 2015
Take Garbage Value wrong output how to use shared memory in a program CUDA Programming and Performance	2	5046	December 23, 2009
multiplication of matrix using shared memory problem of multiplication CUDA Programming and Performance	2	3997	September 30, 2010
error in the result of using shared memory CUDA Programming and Performance	2	622	May 29, 2015
nVidia CUDA Programming Guide and shared memory CUDA Programming and Performance	0	1500	January 12, 2010
memory function does not see the memory, although the device array is copied CUDA Programming and Performance	2	2606	May 29, 2010
copy from 1D array to shared memory matrix in cuda CUDA Programming and Performance	7	2201	June 9, 2015
Unexpected behaviour of matrix multiply demo CUDA Programming and Performance	7	6131	November 11, 2010
using shared memory CUDA Programming and Performance	6	3025	September 17, 2009

Some help needed with shared memory and program correctness matrix * vector operation

Related topics