GTX 660 and Nano performance drop-off after sustained matrix multiplies

100,000 iterations per second is 10 microseconds per iteration. That is approximately the launch latency of a typical CUDA kernel. A matrix multiplication operation of any appreciable size is not going to be finished in 10 microseconds, so your data immediately becomes suspect in this respect.

My guess is you are experiencing a transition from an asynchronous launch queue being “not full” to being “full”. For some reason this topic has surfaced a number of times recently, here and here are recent discussions. This is just a guess, of course, as you have provided no code.

The “long term” rate is probably more reflective of what the GPU can actually sustain from a computational perspective - ~4000/sec for the 660 and ~1000/sec for the nano, corresponding to 250us and 1ms actual kernel durations.

You may want to learn to use a GPU profiler. My guess is that if you used a profiler, you would observe that every kernel duration is on the order of 250us on the 660, from first to last. You’re just witnessing the effects of a non-infinite asynchronous launch queue.