Just to clarify:
The problem I’m having is twofold.
Firstly, gemv on the K20X is seriously underperforming in select cases.
Secondly, the K20X appears to “get tired” when under heavy loads, dropping from an initial 1.1 TFLOPS and settles at about 100GFLOPS. In these scenarios, the “powermizer” is rapidly (2-3 times/second) fluctuating between power saving and performance mode. Further, while the card initially draws 200-220W, it gradually decays to about 100W power usage. Card temperature stabilises around 80-85C
Here are the relevant parts of ldd:
libcuda.so.1 => /usr/lib/nvidia-current/libcuda.so.1 (0x00007f02999b3000)
libcudart.so.5.0 => /usr/local/cuda/lib64/libcudart.so.5.0 (0x00007f0299759000)
libcurand.so.5.0 => /usr/local/cuda/lib64/libcurand.so.5.0 (0x00007f0297838000)
libcusparse.so.5.0 => /usr/local/cuda/lib64/libcusparse.so.5.0 (0x00007f028f805000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f028f5e7000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f028f2e4000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f028efe8000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f028edd1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f028ea12000)
libcufft.so.5.0 => /usr/local/cuda/lib64/libcufft.so.5.0 (0x00007f028ca0c000)
libcublas.so.5.0 => /usr/local/cuda/lib64/libcublas.so.5.0 (0x00007f0288fe7000)
libnpp.so.5.0 => /usr/local/cuda/lib64/libnpp.so.5.0 (0x00007f0283468000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f0283259000)
libGL.so.1 => /usr/lib/nvidia-current/libGL.so.1 (0x00007f0282f3b000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f0282d24000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0282b20000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f0282917000)
libnvidia-tls.so.304.43 => /usr/lib/nvidia-current/tls/libnvidia-tls.so.304.43 (0x00007f0282714000)
libnvidia-glcore.so.304.43 => /usr/lib/nvidia-current/libnvidia-glcore.so.304.43 (0x00007f028032a000)
libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007f027fff0000)
libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007f027fdde000)
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007f027fbbf000)
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007f027f9bb000)
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007f027f7b5000)
ie it is running 5.0.x.
I believe I have solved the first issue.
It seems the gemv code is severely lacking in this case:
M=100,N=2^22. I.e a huge inner dimension with a small(ish) outer dimension.
I coded my own kernel for the same calculation and these are the timings for 100 iterations
gemv= 32.283 seconds (0.32283 sec/iter)
custom kernel = 2.435 seconds (0.02435 sec/iter)
Compared to a Quadro:
gemv: 25.534 seconds
custom kernel = 2.619 seconds
From this we can see that the case I am running has not been properly optimized, and the K20X underperforming is a symptom of a more general problem with gemv here.
However, it remains that the card is basically “giving up” after lengthy calculations. If I put any load on the GPU for extended periods, it is only able to draw about 100W after 60 seconds, as opposed to drawing 200W for early executions. Running times are basically doubled. At the moment I’m putting it down to a faulty card.