Hi all,

I am trying to see by myself the Tesla C870’s performance by calculating the square of a matrix.

This calculation is not multithreaded for each thread computes the square of the matrix, but i implemented 128 threads to compute this at (hopefully) the same time.

My problem appeared after checking that this application was giving me correct results…

The computation time on the device is longer than the computation on the host with the emulation program (make emu=1).

I first thought it was because not enough computation were done on the device, so i modified the program to compute the cube of the same matrix to increase the amount of computation. The problem remains… Can’t get better performance on the Tesla !

The original matrix is usually a 100*100 up to 500*500 cells

The pdA pointer points to this original matrix

The pdR pointer points to a memory room that is 128 times the size of the original matrix

here is the kernel function i wrote for the matrix cube calculation :

[codebox]

// thread function

**global** void MatrixMul(float* pdA, float* pdR, int matsize, clock_t* pdtime)

{

```
//printf("%u\t",clock());
//start time
clock_t btime = clock();
clock_t etime;
//index
int bx = blockIdx.x;
// variables declaration
float* pbuf1 = pdA;
float* pbuf2 = pdR;
int sqrtlen = sqrtf(matsize);
int numl;
int numc;
int i;
int j;
// special declaration
__shared__ float res[MATDIM*MATDIM];
__shared__ float Res;
// Matrix multiplication
for(j=0;j<matsize;++j)
{
numl = j/sqrtlen;
numc = j%sqrtlen;
res[j] = 0;
for(i=0;i<sqrtlen;++i)
{
res[j] += *(pbuf1+i+(numl*sqrtlen)) * *(pbuf1+(sqrtlen*i)+numc);
}
}
// Matrix multiplication
for(j=0;j<matsize;++j)
{
numl = j/sqrtlen;
numc = j%sqrtlen;
Res = 0;
for(i=0;i<sqrtlen;++i)
{
Res += res[i+(numl*sqrtlen)] * *(pbuf1+(sqrtlen*i)+numc);
}
*(pbuf2+j) = Res;
}
// end time
etime = clock();
pdtime[bx] = btime-etime;
```

}

[/codebox]

I also join the full .cu file… (should be opened with wordpad)

I’d be very grateful if someone could give me a clue why something is decreasing that much the on device multithreaded calculation.

thanks in advance,

Electro

ahessai10.rtf (6.55 KB)