Computing GFLOPs

Hello,

I am doing a study of the performance of Matrix-Vector and Matrix-Matrix multiplication on the NVIDIA GPU and wanted to know if I am correctly computing the GFLOPs for my card because I am getting LARGE values for simple Matrix-Vector multiplication. The strange GFLOPs come into play somewhere between a matrix dimension of 2048x2048 and 4096x4096. For simplicity, the matrix is square and vector is a column vector (no compression format is being used).

The code that I am using for Matrix-Vector follows:

unsigned int timer = 0;
cutilCheckError(cutCreateTimer(&timer));
for(unsigned int i = 0; i < MAX; i++){
// Start Timer:
cutilCheckError(cutStartTimer(timer));
// Execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);
// Stop Timer and Collect current TOTAL GPU Time:
cutilCheckError(cutStopTimer(timer));
totalGPUTime += cutGetTimerValue(timer);
}

// Destroy Timer:
cutilCheckError(cutDeleteTimer(timer));
totalGPUTime = totalGPUTime/(double)MAX;
printf(“GPU Processing Time: %f (ms)\n”, totalGPUTime);
double GFLOPs = (double)(2.0WAHBHB)/(double)(10241024*1024);
GFLOPs /= totalGPUTime;

printf(“GPU GFlops = %f\n”, GFLOPs);

MAX is 1 and WA, HB, HB are the WIDTH of A Matrix, and is Height of B Vector.

Also, any matrix dimension greater than 4584x4584 causes the program to crash - I assume this is because the number of threads is greater than can be handled by my GPU?

The hardware configuration of NVIDIA GPU follows:

GPU NVIDIA:

GL Vendor: NVIDIA Corporation
GL Renderer: Quadro NVS 290/PCI/SSE2
GL Version: 3.0.0
Video Memory Installed: 256 MB
Interface Type: PCIe x16
Technology: DDR2 SDRAM 64-bit
Max. Resolution (external): 2560x1600 / 60 Hz
RAMDAC Clock Speed: 350 MHz
Driver Version:
ALU Instructions: 16384
TEX Instructions: 16384
TEX Indirections: 16384
MAX_TEXTURE_IMAGE_UNITS: 32
MAX_TEXTURE_COORDINATES: 32

Thanks for any hints/information :D

try following code to measure ellapsed time

unsigned int timer = 0;

cutilCheckError(cutCreateTimer(&timer));

// Start Timer:

cutilCheckError(cutStartTimer(timer));

for(unsigned int i = 0; i < MAX; i++){

// Execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

}

cudaThreadSynchronize();

// Stop Timer and Collect current TOTAL GPU Time:

cutilCheckError(cutStopTimer(timer));

totalGPUTime = cutGetTimerValue(timer);

// Destroy Timer:

cutilCheckError(cutDeleteTimer(timer));

totalGPUTime = totalGPUTime/(double)MAX;

printf("GPU Processing Time: %f (ms)\n", totalGPUTime);

double GFLOPs = (double)(1.0*WA*HB*HB)/(double)(1024*1024*1024);

GFLOPs /= totalGPUTime /1000.0;

Remark: total number of MAD (multiplication and addition) operation is N^3, not 2*N^3.

How do you setup “grid” and “threads” ?