What is wrong with this picture? Linear time increase vs N

Is this suppose to be an horizontal flat line?


I did a test with 10,100,1000,10000,20000 N to verify the parrallelism achieved with my kernels. Not much it seems.
The memory tranfer time is all the same because I transfer the whole array every time. The GPU time shown is the execution time of one kernel only.
Are all the threads executed at the same time up to 20,000 threads at least for a GTX 260 ?

A typical occupancy analysis :

Kernel details : Grid size: 141 x 141, Block size: 300 x 1 x 1
Register Ratio = 0.6875 ( 11264 / 16384 ) [35 registers per thread]
Shared Memory Ratio = 0.03125 ( 512 / 16384 ) [32 bytes per Block]
Active Blocks per SM = 1 : 8
Active threads per SM = 300 : 1024
Occupancy = 0.3125 ( 10 / 32 )
Achieved occupancy = 0.3125 (on 27 SMs)
Occupancy limiting factor = Registers
Warning: Grid Size (19881) is not a multiple of available SMs (27).


Perhaps you should give a little more context, otherwise it will be hard to tell you what’s wrong with your curve …


The context is clear: the kernel is suppose to calculate N times the same function.

The curve should therefore be quite flat : i.e the same computation time for 10 or 1,000 up to 20,000 calculations if all is done in parralel. Or am I completely lost ?

I did some more tests on my kernel.

As far as I can tell the threads numbers launched is equal to N , by multiplying the “cta launched” number by the number of SM (27) in my case.
What seems to be happening, with occupancy close to 1 for my cases, is that the number of branches and of divergent branches is varying linearly with N ( if N > 100000) .
Although I have many divergent branches, there is no warp serialize number given.

Anyone knows what divergent branches and branch predication is all about?

No? You’re totally confusing the programming model with the hardware implementation.

192 or 224 threads run in parallel during any given shader clock cycle (depending on when you

bought that GTX 260).

My GTX 260 has 216 Scalar Processor ( 27 Streaming Multirprocessor).

The cta_lauched count in the profiler output corresponds to 3 SM (TPC #0).

If I multiply the cta-launched with 9 (27/3) , I get the total number of parrallel threads launched, Yes?

I found that I had included a loop in the kernel which shoud have been outside all along. Now my program is much speedier.

Ah, you’re right about the 216. ;)

How does your plot look like without this loop? I bet it’s still linearly increasing.

The profiler only collects stats for one TPC (TPC = cluster of SMs). There are 3 SMs per TPC in the GT200 chip ( source: http://www.geeks3d.com/20100318/tips-what-…cluster-or-tpc/ ), so that’s why you need a factor of 27/3 = 9 to get to the expected number.

I think the compiler optimized something out… Check…