Or perhaps you have a lot of atomic operations ? According to the whitepaper on GK104 there was a 3x atomic performance improvement, which I guess would be closely related to the latency improvements that you mention.
Well if for example your application has a high bandwidth utilization we would consider it to be bandwidth-bound, ie the bandwidth of the card is the bottleneck of your application.
Now if your problem has a very high arithmetic intensity ( ex many addition and multiplications on the same data element => high FLOP / byte ratio ) we might say that it is compute-bound. If this is the case your performance increase would make sense since the GTX670 has roughly 2x the theoretical performance of the GTX570.