performance problem with matrix calculation

Hi all,

I am trying to see by myself the Tesla C870’s performance by calculating the square of a matrix.

This calculation is not multithreaded for each thread computes the square of the matrix, but i implemented 128 threads to compute this at (hopefully) the same time.

My problem appeared after checking that this application was giving me correct results…

The computation time on the device is longer than the computation on the host with the emulation program (make emu=1).

I first thought it was because not enough computation were done on the device, so i modified the program to compute the cube of the same matrix to increase the amount of computation. The problem remains… Can’t get better performance on the Tesla !

The original matrix is usually a 100100 up to 500500 cells

The pdA pointer points to this original matrix

The pdR pointer points to a memory room that is 128 times the size of the original matrix

here is the kernel function i wrote for the matrix cube calculation :

[codebox]

// thread function

global void MatrixMul(float* pdA, float* pdR, int matsize, clock_t* pdtime)

{

//printf("%u\t",clock());

//start time

clock_t btime	=	clock();

clock_t	etime;



//index

int	bx	=	blockIdx.x;



// variables declaration

float*	pbuf1	=	pdA;

float*	pbuf2	=	pdR;

int	sqrtlen	=	sqrtf(matsize);

int	numl;

int	numc;

int	i;

int	j;	



// special declaration

__shared__ float res[MATDIM*MATDIM];

__shared__ float Res;



// Matrix multiplication

for(j=0;j<matsize;++j)

{

	numl	=	j/sqrtlen;

	numc	=	j%sqrtlen;

	res[j]	=	0;

	for(i=0;i<sqrtlen;++i)

	{

		res[j]	+=	*(pbuf1+i+(numl*sqrtlen))	*	*(pbuf1+(sqrtlen*i)+numc);

	}

}

// Matrix multiplication

for(j=0;j<matsize;++j)

{	

	numl	=	j/sqrtlen;

	numc	=	j%sqrtlen;

	Res	=	0;

	for(i=0;i<sqrtlen;++i)

	{

		Res	+=	res[i+(numl*sqrtlen)]	*	*(pbuf1+(sqrtlen*i)+numc);

	}

	

	*(pbuf2+j)	=	Res;

}



// end time

etime	=	clock();

pdtime[bx]	=	btime-etime;

}

[/codebox]

I also join the full .cu file… (should be opened with wordpad)

I’d be very grateful if someone could give me a clue why something is decreasing that much the on device multithreaded calculation.

thanks in advance,

Electro
ahessai10.rtf (6.55 KB)

I suggest you look at the sample in “NVIDIA_CUDA_Programming_Guide_2.0.pdf”,Chapter 6.
To process the Matrix as a 2D array and calculate it by Sub Matrix Block