Dense GEMV issues with K20 versus c2070

I have just installed the K20 only to find that the performance in fact drops for cublas gemv calls.
I am trying to calculate the matrix multiplication y=A*x where dim(A) = (6e5,100) and dim(x) = (100,1)
Averaging over 100 executions, I have the following timings
K20 :
Average : 0.061 sec
SD : 0.010 sec

C2070
Average : 0.029 sec
SD : 0.005 sec

I think there’s something seriously amiss with gemv() when called by K20 units. Have also noticed large amounts of variability in the K20’s performance.

Too small size, too small time, btw, also do you use last version of cublas, it needs to be tweaked.

Can you expand on it needing-to-be-tweaked? I don’t believe that the time is too small, by the way, because K20 timings ranged from 0.05 as an absolute minimum, to peaking at 0.1, whereas the c2070 ranged from 0.025 to 0.37 over 100 executions. If anything execution times should be equal.

Regarding variability in performance, I tested the following gemv calculations
y = M(512i,512i) * M(512*i,1) i=1,2,3…10
Speed calculations on the K20 ranged from 300GFLOP to 1.1 TFLOP, whereas the 2070 reached a constant 450 GFLOP

k20 has a bit different architecture as always. Btw, it maybe effect of batching small kernels, need to ask nv developers. So, do you use last version of cublas that is aware of k20?

Yes, I have the latest version of K20.

Heres my benchmarks:
Current device: Tesla K20Xm
Begin benchmarking
Width: 512. GFLOPS: 619
Width: 1024. GFLOPS: 1042
Width: 1536. GFLOPS: 1080
Width: 2048. GFLOPS: 1118
Width: 2560. GFLOPS: 812
Width: 3072. GFLOPS: 649
Width: 3584. GFLOPS: 734
Width: 4096. GFLOPS: 598

Current device: Quadro 6000
Begin benchmarking
Width: 512. GFLOPS: 282
Width: 1024. GFLOPS: 307
Width: 1536. GFLOPS: 313
Width: 2048. GFLOPS: 315
Width: 2560. GFLOPS: 315
Width: 3072. GFLOPS: 316
Width: 3584. GFLOPS: 316
Width: 4096. GFLOPS: 316

Emphasis on variability.

Current device: Tesla K20Xm
Begin benchmarking
Gemv: 100x 100. MBPS: 2404
Gemv: 100x 1000. MBPS: 2913
Gemv: 100x 10000. MBPS: 3295
Gemv: 100x 100000. MBPS: 8102
Gemv: 1000x 100. MBPS: 24699
Gemv: 1000x 1000. MBPS: 20765

Current device: Quadro 6000
Begin benchmarking
Gemv: 100x 100. MBPS: 4653
Gemv: 100x 1000. MBPS: 4775
Gemv: 100x 10000. MBPS: 8535
Gemv: 100x 100000. MBPS: 15769
Gemv: 1000x 100. MBPS: 36502
Gemv: 1000x 1000. MBPS: 41052

What is about version of cublas?

Ill check when i get home, bit it is using _v2 and was downloaded from nvidia last week.

Are you sure that you use toolkit5.0 ?
Do you use GEMV or GEMM with n=1,k=100? It looks like you are using the latter

On my K20C, dgemv with m=600000, n=100
^^^^ CUDA elapsed = 0.00358295 sec GFLOPS=33.6594
Using K20X, you can expect 6/5 better perf for this memory bound problem because bus width is 384bits versus 320bits.

On my C2070,
^^^^ CUDA elapsed = 0.00520205 sec GFLOPS=23.1832

There are 2 benchmarks there.

The first is a gemm benchmark, multiplying 2 square matrices together. The main point of that benchmark was to illustrate the high variability I was experiencing with the K20X. Overheating doesn’t seem to be a problem - it was at 40C(100F) degrees when I ran this test, however the “Powermizer” feature was staying in the slowest state. I supplied a benchmark of the Quadro on the same computer to illustrate it’s much more stable on the same system.

The second benchmark is a gemv benchmark with the dimensions listed. In every example above, the K20X was outperformed by a quadro 6000.

Edit: Can you please supply a link to the toolkit that I should be using, just to double check we’re both running off the same code…

The cublas library should have an extension 5.0.x

If you are on Linux, can you make sure that you do not pickup an older library because of a wrong LD_LIBRARY_PATH. It is possible that you use an older library that jit sm20 PTX into sm35.

Can you show the results of : ldd

Just to clarify:
The problem I’m having is twofold.
Firstly, gemv on the K20X is seriously underperforming in select cases.
Secondly, the K20X appears to “get tired” when under heavy loads, dropping from an initial 1.1 TFLOPS and settles at about 100GFLOPS. In these scenarios, the “powermizer” is rapidly (2-3 times/second) fluctuating between power saving and performance mode. Further, while the card initially draws 200-220W, it gradually decays to about 100W power usage. Card temperature stabilises around 80-85C

Here are the relevant parts of ldd:
libcuda.so.1 => /usr/lib/nvidia-current/libcuda.so.1 (0x00007f02999b3000)
libcudart.so.5.0 => /usr/local/cuda/lib64/libcudart.so.5.0 (0x00007f0299759000)
libcurand.so.5.0 => /usr/local/cuda/lib64/libcurand.so.5.0 (0x00007f0297838000)
libcusparse.so.5.0 => /usr/local/cuda/lib64/libcusparse.so.5.0 (0x00007f028f805000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f028f5e7000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f028f2e4000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f028efe8000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f028edd1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f028ea12000)
libcufft.so.5.0 => /usr/local/cuda/lib64/libcufft.so.5.0 (0x00007f028ca0c000)
libcublas.so.5.0 => /usr/local/cuda/lib64/libcublas.so.5.0 (0x00007f0288fe7000)
libnpp.so.5.0 => /usr/local/cuda/lib64/libnpp.so.5.0 (0x00007f0283468000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f0283259000)
libGL.so.1 => /usr/lib/nvidia-current/libGL.so.1 (0x00007f0282f3b000)
/lib64/ld-linux-x86-64.so.2 (0x00007f02a032a000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f0282d24000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0282b20000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f0282917000)
libnvidia-tls.so.304.43 => /usr/lib/nvidia-current/tls/libnvidia-tls.so.304.43 (0x00007f0282714000)
libnvidia-glcore.so.304.43 => /usr/lib/nvidia-current/libnvidia-glcore.so.304.43 (0x00007f028032a000)
libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007f027fff0000)
libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007f027fdde000)
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007f027fbbf000)
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007f027f9bb000)
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007f027f7b5000)
ie it is running 5.0.x.

I believe I have solved the first issue.
It seems the gemv code is severely lacking in this case:
M=100,N=2^22. I.e a huge inner dimension with a small(ish) outer dimension.
I coded my own kernel for the same calculation and these are the timings for 100 iterations
K0X:
gemv= 32.283 seconds (0.32283 sec/iter)
custom kernel = 2.435 seconds (0.02435 sec/iter)

Compared to a Quadro:
gemv: 25.534 seconds
custom kernel = 2.619 seconds

From this we can see that the case I am running has not been properly optimized, and the K20X underperforming is a symptom of a more general problem with gemv here.

However, it remains that the card is basically “giving up” after lengthy calculations. If I put any load on the GPU for extended periods, it is only able to draw about 100W after 60 seconds, as opposed to drawing 200W for early executions. Running times are basically doubled. At the moment I’m putting it down to a faulty card.

You can run nvidia-smi in a loop, e.g. nvidia-smi -q -l, to monitor clocks, power, and temperature while running a CUDA app continuously. After you start the app, if you first see the temperature rise, with clocks and power consumption high, and later see GPU clocks and power consumption decreasing while the app continues to run, that is a possible indication that the GPU is slowing itself down due to excessive thermal load. The purpose of such a slowdown is to prevent permanent damage due to overheating.

The K20X is a passively cooled device. Excessive thermal load can be caused by insufficient airflow in the enclosure, whose fans are responsible for generating the airflow necessary for cooling the GPU. I would suggest following up with your system vendor on this issue.

The separate issue of GEMV being slower than expected for certain matrix sizes is noted, but I do not have any insight into this at this time, I don’t normally use that function. Since you can demonstrate this across two different types of GPU, I would suggest filing a bug through the registered developer website, attaching a self-contained repro app.

Hi sBc-Random,

Without some logs it’s hard to judge what is happening here. It’d be great if you could attach log from “nvidia-smi --query --loop 1”. I’d look at “Clocks Throttle Reasons” section:

Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active  <- this will become active if clocks are dropping because of power cap
        HW Slowdown             : Not Active  <- if this is Active then most probable cause is overheating
        Unknown                 : Not Active

It would also be good to double check that power cap is set to default (I believe 225W for K20X)

Power Readings
        Power Management        : Supported
        Power Draw              : 14.59 W
        Power Limit             : 225.00 W  <- current
        Default Power Limit     : 225.00 W  <- defualt
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W  <- is default same as max?

Monitoring clocks throughout app execution is also a good idea.

Ok. having run nvidia-smi (albiet in windows)

Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active

    Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active

Memory clock is dropping from 2600 to 324 Mhz. Otherwise, everything else looks stable.

So the power cap is tripping…
Edit: It’s tripping the power cap only by drawing 110W. Must be a faulty card…

Can a mod please close this thread. I will repost it with a more suitable subject name, as the subject has diverged.