I expect that this is due to a decrease in memory latency, which in turn is associated with an increase in memory frequency. My expectation is that the latency decreased by 2-3 times.
Is this true?
Or perhaps you have a lot of atomic operations ? According to the whitepaper on GK104 there was a 3x atomic performance improvement, which I guess would be closely related to the latency improvements that you mention.
Well if for example your application has a high bandwidth utilization we would consider it to be bandwidth-bound, ie the bandwidth of the card is the bottleneck of your application.
Now if your problem has a very high arithmetic intensity ( ex many addition and multiplications on the same data element => high FLOP / byte ratio ) we might say that it is compute-bound. If this is the case your performance increase would make sense since the GTX670 has roughly 2x the theoretical performance of the GTX570.
Samuel Williams, Andrew Waterman, David Patterson
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM, Volume 52, Issue 4, April 2009, pp. 65-76