When increase threads linearly, throughput might not go linear at the end

In the same scenario and keep everything else unchanged, for example always using B blocks where B is identical to number of multiprocessors. Just double threads each time from 32 to 1024.
At first, the throughput doubles until the thread is set to 256. When double thread from 256 to 512, the throughput does increase, but not double; when further double thread from 512 to 1024, the throughput only increase a tiny little bit, but relatively flat.

So my question is why double threads can not guarantee throughput to be doubled? Is there any factor that determine the relation? Is that any other determinant parameters?

Let me guess: you have a compute capability 1.x device.

If every instruction depends on the result of the previous one, 1.x devices need 6 warps (192 threads) to fully hide instruction latency. So you basically have 192 slots that you can fill with threads, and throughput in proportional to the occupancy.
Once you exceed 192 threads, additional threads can only be used to hide memory latency, otherwise the just queue waiting for execution.

The principle is the same for 2.x devices, just the numbers are higher.

What I use is GTX 480, which is with capability 2.0. Do you know what is the number? and what is the name for this number?

How do you define the occupancy here?