peak computational throughput

Dears
How can I know the peak computational throughput (GFLOP/s) for my installed GPU (GeForce GT 740m) ??

Essentially all of the data you need to calculate that can be gotten from deviceQuery. I suggest posting the deviceQuery output from your GT 740m

Using the FFMA instruction a single cuda core can compute 2 floating point operations in 1 clock cycle. 1 for multiply and 1 for add. So peak FLOPS is just the number of cuda cores times the clock frequency times 2. On TitanX you have:

3072 * 2 * 10**6 = 6.144 TFLOPS

The boost clock will let you compute more than that, but it typically can’t be sustained because of power or heat constraints.

I belive the 740m is a Kepler part with 384 cores at 810 MHz. That’s 622 GLOPS. On kepler some of the cuda cores are shared between schedulers and in practice utilizing more than ~70% of them at any one time is not possible. So you should see sgemm benchmarks run at around 430 GFLOPS. The sgemm implementation in cublas can run FFMA’s just about as fast as the cores can process them (provided matrix dimensions are big enough).

And there is a corresponding divisor if you are referring to double-precision GFLOPS, which varies by GPU. On Titan X the divisor is 32 (divide 6.144 by 32) and on 740m if it is a Kepler part it should have a divisor of 24 (622/24 = ~25.9 DP GFLOPS)