Doing precise timing tests on my system display device (or perils thereof)

I have a low end GPU which I bought for use as my system (Windows) display device. I’m using it for CUDA development work, for what I hope will be a high performing application. Before I invest in a more powerful GPU and look for development partners, I’d like to have some confidence that my code is capable of running at near 100% of the compute capacity of the ultimate target device.
If any of you think this is a fool’s errand, please tell me.
Otherwise, I want to get some reasonable timings of the kernel code on my present device. The kernel is predominantly compute bound and does little communication with the host. Ideally, the clock() interval should be the exactly the same for the same code most of the time.
However, when the times get up to 200 or more cycles, the actual interval samples are so far spread out that I cannot tell how long the code actually takes when nothing is interfering with it.
My device has only a single Kepler SMX, and it is being used for all the system display functions. I would like to reserve half of the SMX for my kernel, which is 96 of the CUDA cores and two warp schedulers. If the second warp scheduler can be kept idle, and the first one running nothing but my kernel, then supposedly it should run the same speed all (or almost all) of the time. Then my clock intervals would be a lot more useful to me.
My device does indeed support priorities, with levels 0 and -1. Do you think this would help my kernel get more processing cycles. And what functions could I call to control the warp schedulers? Is it possible to write a second kernel for the second scheduler that will not interfere with instruction dispatch by the first scheduler – if so, then I might assign both of these kernels to the same pair of schedulers and reserve them.

You don’t have the ability to individually control warp schedulers, or split up a SMX the way you are suggesting. Priorities don’t allow you to arbitrate between CUDA and graphics.

If you want to reserve a device for CUDA, and in general my recommendation for CUDA work, I would switch to linux.

I would recommend getting a more powerful device dedicated to CUDA as soon as possible, and devote your existing card to driving the display.
My experience when I started with CUDA a while ago was that the scaling did not work out as I anticipated, and I had to rework large parts of my code.
CUDA has become more forgiving to non-optimal code since, and easier to use in general. So my experience may not directly translate to contemporary CUDA versions and GPUs, where even low-end devices are much more powerful now. But I’d believe it still makes a difference, and timing code on a WDDM device also driving the display is so hard I wouldn’t even try.

On Windows you’d want a card you can put into TCC mode (Tesla, or apparently (some?) Titans which you’ll now have to buy second hand).