Hello,

I am doing a study of the performance of Matrix-Vector and Matrix-Matrix multiplication on the NVIDIA GPU and wanted to know if I am correctly computing the GFLOPs for my card because I am getting LARGE values for simple Matrix-Vector multiplication. The strange GFLOPs come into play somewhere between a matrix dimension of 2048x2048 and 4096x4096. For simplicity, the matrix is square and vector is a column vector (no compression format is being used).

The code that I am using for Matrix-Vector follows:

…

unsigned int timer = 0;

cutilCheckError(cutCreateTimer(&timer));

for(unsigned int i = 0; i < MAX; i++){

// Start Timer:

cutilCheckError(cutStartTimer(timer));

// Execute the kernel

matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

// Stop Timer and Collect current TOTAL GPU Time:

cutilCheckError(cutStopTimer(timer));

totalGPUTime += cutGetTimerValue(timer);

}

// Destroy Timer:

cutilCheckError(cutDeleteTimer(timer));

totalGPUTime = totalGPUTime/(double)MAX;

printf(“GPU Processing Time: %f (ms)\n”, totalGPUTime);

double GFLOPs = (double)(2.0*WA*HB*HB)/(double)(1024*1024*1024);

GFLOPs /= totalGPUTime;

printf(“GPU GFlops = %f\n”, GFLOPs);

…

MAX is 1 and WA, HB, HB are the WIDTH of A Matrix, and is Height of B Vector.

Also, any matrix dimension greater than 4584x4584 causes the program to crash - I assume this is because the number of threads is greater than can be handled by my GPU?

The hardware configuration of NVIDIA GPU follows:

# GPU NVIDIA:

GL Vendor: NVIDIA Corporation

GL Renderer: Quadro NVS 290/PCI/SSE2

GL Version: 3.0.0

Video Memory Installed: 256 MB

Interface Type: PCIe x16

Technology: DDR2 SDRAM 64-bit

Max. Resolution (external): 2560x1600 / 60 Hz

RAMDAC Clock Speed: 350 MHz

Driver Version:

ALU Instructions: 16384

TEX Instructions: 16384

TEX Indirections: 16384

MAX_TEXTURE_IMAGE_UNITS: 32

MAX_TEXTURE_COORDINATES: 32

Thanks for any hints/information :D