Hello,

I’m analysing the performance of GPUs for my diploma thesis.

Now I’ve some problems with the informations given in the perfomrance guidelines of the CUDA Programming Guide.

It says that every multiprocessor performs 8 float multiplications every 32 clock cyles. This means to me it executes 1 mulptiplication on a warp per 4 clocks, right?

It further says, that a mp performs 0.88 float divisons per 32 clock cyles. So it takes about 37 clocks to execute 1 divison on a warp?!

This would mean to me that a multiplication is almost 10 (37/4) times faster than a divison.

So I implementet this two little test kernels

```
calcMultiKernel( float* g_odata, float multi)
{
// access thread id
const unsigned int tid = blockDim.x*(blockIdx.x+gridDim.x*blockIdx.y) + threadIdx.x;
float a=g_odata[tid];
for(int i=0;i<1000000;i++)
a*= multi;
g_odata[tid] = a;
}
__global__ void
calcDivKernel( float* g_odata, float divider)
{
// access thread id
const unsigned int tid = blockDim.x*(blockIdx.x+gridDim.x*blockIdx.y) + threadIdx.x;
float a=g_odata[tid];
for(int i=0;i<1000000;i++)
a/=divider;
g_odata[tid] =a;
}
```

It’s just to illustrate that the given informations are right.

I multipy und divide 1 million times, so that the impact of the memory latency should become very low.

I tried this on a GTX 280 and a 9500 GT with a range from 1 to 2000 threads and varied the iteration length from 1 million to 100 million. But the results always show, that the divison only needs 40% more time than a multiplication.

The same problems appears with sin/cos, log and sqrt. The sin(32) function should be 8 times slower, but is only 10%. Sqrt(16) and log(16) are only 25-30% slower, but should be around 400%.

What is the problem? Did I made a mistake or are the guide informations wrong?

Thx