GTX 660 and Nano performance drop-off after sustained matrix multiplies

Robert_Crovella · July 11, 2022, 5:45pm

100,000 iterations per second is 10 microseconds per iteration. That is approximately the launch latency of a typical CUDA kernel. A matrix multiplication operation of any appreciable size is not going to be finished in 10 microseconds, so your data immediately becomes suspect in this respect.

My guess is you are experiencing a transition from an asynchronous launch queue being “not full” to being “full”. For some reason this topic has surfaced a number of times recently, here and here are recent discussions. This is just a guess, of course, as you have provided no code.

The “long term” rate is probably more reflective of what the GPU can actually sustain from a computational perspective - ~4000/sec for the 660 and ~1000/sec for the nano, corresponding to 250us and 1ms actual kernel durations.

You may want to learn to use a GPU profiler. My guess is that if you used a profiler, you would observe that every kernel duration is on the order of 250us on the 660, from first to last. You’re just witnessing the effects of a non-infinite asynchronous launch queue.

Topic		Replies	Views
Matrix multiplication performance issue CUDA Programming and Performance	14	199	June 12, 2025
CUDA very slow performance CUDA Programming and Performance	21	17023	March 6, 2020
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	1137	September 6, 2023
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3487	December 26, 2008
cuBlas performance dramatically drops after some iterations CUDA Programming and Performance	4	955	January 18, 2015
Slow CUDA SGEMM CUDA Programming and Performance	5	752	September 15, 2022
Strange FLOP counts CUDA Programming and Performance	21	10317	March 15, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6654	February 19, 2009
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5692	August 12, 2009
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4418	January 14, 2010

GTX 660 and Nano performance drop-off after sustained matrix multiplies

Related topics