release and emulation release comparison


I’ve written a simple program implementing the cublasSgemm() function and timing the calculation process in order to obtain an effective processing power measurement.
In order to compare the processing power of the tesla card and the host computer i made the programs without flag for the first one and with the “emu=1” flag for the second one.

Depending on the size of the entrance matrix i use in the cublasSgemm function i obtain a huge difference between the release and the emulation release. (i use only one matrix as an entry for the function in order to reduce the transfer between the device and the host).

The difference in processing power is so huge (170 Gflops for the tesla card, 16 Mflops for the single core used in the host, for an 1600 square input matrix) that i wonder if the comparison makes sense…

Has anyone got an idea about that ?

Does anyone know about a program that calculates the processing power ? I’m a bit frustrated with 170 Gflops eventhough this value is calculated with the number of operations i would perform to obtain the same result, so any calculation with addresses (nor anything else) is not considered in this value.

Thanks in advance,


No, the comparison does not make sense. Host emulation is a very slow way of doing a Sgemm. You should compare with the MKL for example to get a fair comparison.

Thanks but i’m affraid i have no idea what MKL could be…
Is it a program ?

First hit on google:…/eng/307757.htm

Thanks for the information !!! :thumbup: