Confusion about performance guide information


I’m analysing the performance of GPUs for my diploma thesis.

Now I’ve some problems with the informations given in the perfomrance guidelines of the CUDA Programming Guide.

It says that every multiprocessor performs 8 float multiplications every 32 clock cyles. This means to me it executes 1 mulptiplication on a warp per 4 clocks, right?

It further says, that a mp performs 0.88 float divisons per 32 clock cyles. So it takes about 37 clocks to execute 1 divison on a warp?!

This would mean to me that a multiplication is almost 10 (37/4) times faster than a divison.

So I implementet this two little test kernels

calcMultiKernel( float* g_odata, float multi) 


	// access thread id

	const unsigned int tid = blockDim.x*(blockIdx.x+gridDim.x*blockIdx.y) + threadIdx.x;

	float a=g_odata[tid]; 

	for(int i=0;i<1000000;i++)

		a*= multi;

	g_odata[tid] = a;


__global__ void

calcDivKernel( float* g_odata, float divider) 


	// access thread id

	const unsigned int tid = blockDim.x*(blockIdx.x+gridDim.x*blockIdx.y) + threadIdx.x;

	float a=g_odata[tid]; 

	for(int i=0;i<1000000;i++)


	g_odata[tid] =a;


It’s just to illustrate that the given informations are right.

I multipy und divide 1 million times, so that the impact of the memory latency should become very low.

I tried this on a GTX 280 and a 9500 GT with a range from 1 to 2000 threads and varied the iteration length from 1 million to 100 million. But the results always show, that the divison only needs 40% more time than a multiplication.

The same problems appears with sin/cos, log and sqrt. The sin(32) function should be 8 times slower, but is only 10%. Sqrt(16) and log(16) are only 25-30% slower, but should be around 400%.

What is the problem? Did I made a mistake or are the guide informations wrong?


I would suspect that the branching of the for loop (incrementing integer i and checking against 1000000) would also add to the timing unless the loop is completely unrolled which I wouldn’t recommend for millions of iterations :)


This is where pipelining comes into effect. I had the same exact doubt as u have. I posted it here and figure this out…

The pipeline makes sure that successive multiplication and division operations from 4 threads (8 cores execute for 32 threads in a warp) flow on a pipeline and hence get overlapped.

The more active threads you have, the better the latencies are hidden.

You should try with 1 thread and 1 block and then profile. (make sure you do it with RAW hazard, without the hazard) and see how good the pipelining works.

Thx for your reply.

What u mean by branching?

The loop time could be a point. Comparison and addition should take 4 clocks each. So the multiplcation is only 0,3 of the whole time (12 clocks) needed per iteration. While the divison should be 0,82 of the whole time (45). I must think about how this impact the result. Thx.

Sorry, I had problems with the internet, so I didn’t see your post.

Where can I see the occupancy of the pipes?

Another quest:

Is every loop a kind of branching, mentioned is the programming guide, where a MP follows each branch separately and deactivates all threads that are not following the same branch? I mean is every loop such a branch or only that ones, that do different numbers of iterations. I thought as long as every loop does the same, that doesn’t mean branching.

Branching was a poor choice of words from my part. When testing against the iteration count, each thread can either move on to the next iteration of the loop or it can exit the loop depending on the result of the test, that’s why I referred to it as a branch. But in your case the branches are not divergent and all thread execute the same number of iterations. I was merely pointing out the fact that the operations associated with incrementing the counter value and the conditional test would count against your timing results.


Thx again.

I’m now using the visual profiler to analyse my kernels. It says that my kernels to exactly as many branches as iteration. So it seems like it interprets each loop as branch. But it differs branches and divergent branches. I’m not sure about the differenz. In my eyes divergent branches are that type of branches I mentioned before. But what are “normal” branches? Do they have an effect on the performance? should they be avoided?


I tried to figure out the time needed by the loop. When I calcuate the real time needed by the function, the ratio is somewhere by 4. So till now multiplication is 4 times faster…still missing some performance.

Divergent branches are the branches where different threads in the same (half-)warp choose a different path of execution. When your iteration count is not data-dependent, such as a fixed iteration count you can avoid the branches by unrolling the loops using a either #pragma unroll statement or by manually unrolling the loops yourself, but note that you can’t unroll a loop with many iterations as it will greatly expand the size of the generated cubin.