Performance Conters (Flop Counting)

Hi everyone,

A quick question here, I just finished coding on a matrix matrix multiplication using Cuda, now I am hoping to count the “Flop” to get some idea on the performance of my implementation. So I was just wondering if there is any tool that allows me to do it?

Thanks a lot!

What about time measuring? With CUDA asynchronous events you can mesaure execution time with a clock cycle resolution.

cudaEvent_t start, stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord(start, 0);

// LAUNCH KERNEL

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

float et;

cudaEventElapsedTime(&et, start, stop);

cudaEventDestroy(start);

cudaEventDestroy(stop);

et contains the elapsed time in milliseconds. Lower value means better algorithm :)