I did a test with 10,100,1000,10000,20000 N to verify the parrallelism achieved with my kernels. Not much it seems.
The memory tranfer time is all the same because I transfer the whole array every time. The GPU time shown is the execution time of one kernel only.
Are all the threads executed at the same time up to 20,000 threads at least for a GTX 260 ?

A typical occupancy analysis :

Kernel details : Grid size: 141 x 141, Block size: 300 x 1 x 1
Register Ratio = 0.6875 ( 11264 / 16384 ) [35 registers per thread]
Shared Memory Ratio = 0.03125 ( 512 / 16384 ) [32 bytes per Block]
Active Blocks per SM = 1 : 8
Active threads per SM = 300 : 1024
Occupancy = 0.3125 ( 10 / 32 )
Achieved occupancy = 0.3125 (on 27 SMs)
Occupancy limiting factor = Registers
Warning: Grid Size (19881) is not a multiple of available SMs (27).

The context is clear: the kernel is suppose to calculate N times the same function.

The curve should therefore be quite flat : i.e the same computation time for 10 or 1,000 up to 20,000 calculations if all is done in parralel. Or am I completely lost ?

As far as I can tell the threads numbers launched is equal to N , by multiplying the “cta launched” number by the number of SM (27) in my case.
What seems to be happening, with occupancy close to 1 for my cases, is that the number of branches and of divergent branches is varying linearly with N ( if N > 100000) .
Although I have many divergent branches, there is no warp serialize number given.

Anyone knows what divergent branches and branch predication is all about?
Thanks

How does your plot look like without this loop? I bet it’s still linearly increasing.

The profiler only collects stats for one TPC (TPC = cluster of SMs). There are 3 SMs per TPC in the GT200 chip ( source: http://www.geeks3d.com/20100318/tips-what-…cluster-or-tpc/ ), so that’s why you need a factor of 27/3 = 9 to get to the expected number.