GTX 1070 not running full clocks in CUDA

edit: answer is here: https://devtalk.nvidia.com/default/topic/892842/one-weird-trick-to-get-a-maxwell-v2-gpu-to-reach-its-max-memory-clock-/
I love how you have to go deep in to command line to make your product work out of the box. Thank you NVIDIA.

Hi,

I am having trouble getting by card to run at full speed in CUDA.

If I run a GPU Stress Test, I see my core go to 1898MHz
If I run a CUDA app, eg. matrixMulCUBLAS sample, my core runs at 1506MHz and i get low performance
If I run both the GPU Stress Test AND matrixMulCUBLAS, my core goes to 1989MHz and I get faster CUDA performance.

Why do I have to run a fake application in the background to get CUDA to run fast? This is extremely frustrating!

I dont even want boost. I just want to hard code a good MHz and fan speed and forget about it. I will check out nvidia-smi maybe that will fix things.

nvidia-smi -q says:

   Process ID                  : 3948
        Type                    : C
        Name                    : C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Debug\matrixMulCUBLAS.exe

Clocks
Graphics : 16 MHz
SM : 16 MHz
Memory : 3802 MHz
Video : 544 MHz

Applications Clocks
    Graphics                    : 1506 MHz
    Memory                      : 4004 MHz

Default Applications Clocks
    Graphics                    : 1506 MHz
    Memory                      : 4004 MHz

Max Clocks
    Graphics                    : 1911 MHz
    SM                          : 1911 MHz
    Memory                      : 4004 MHz
    Video                       : 1708 MHz

Clock Policy
    Auto Boost                  : N/A
    Auto Boost Default          : N/A

post your matrixMulCUBLAS.exe performance.
My output was :

./matrixMulCUBLAS 
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GeForce GTX 1070" with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 3120.26 GFlop/s, Time= 0.063 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

Windows 10

2482.76 GFlop/s out of the box,
3034.77 GFlop/s setting application clocks 4004,1911

how do i overclock this thing btw!! any why is your linux faster?