You can leave out the cudaDeviceSynchronize() after the kernel is you want. On a Linux system, with a modern CUDA version, using a null kernel, you would find that each launch takes about 5 usec, and 20 usec with the cudaDeviceSynchronize() added back in. If you then make the kernel launch more complex by adding bound textures, and time the increment due to each additional texture, I think you will find pretty much the timing stated in the blog article.
If you now repeat the experiment on Windows, with the WDDM driver model (which is what you are stuck with when using a consumer GPU) you will see something like this: Already with a null kernel, timing is all over the place. From 10 usec to 80 usec for the launch only, that is, without a call to cudaDeviceSynchronize(). Average across many launches maybe 20 usec. Add back the call to cudaDeviceSynchronize (but still using null kernel) and some kernel executions will clock in at significantly above 100 usec. You are observing the effects of (1) much higher overall overhead in the WDDM driver model, (2) launch batching in the CUDA driver making the execution times uneven, but reduced overall and average WDDM overhead.
I have not performed a test under the WDDM driver model with many bound textures, but it stands to reason that the overall overhead per texture added will remain significantly higher than what is seen with the Linux driver, or the TCC driver on Windows for that matter (which is a non-graphics driver that can operate without much OS interference).
In a nutshell, if driver overhead is important for your use case, you would want to use either Linux or a professional GPU with TCC driver under Windows. Don’t know about Mac OS X.
Note that the performance of the host system has some influence on the launch overhead. Usually minor in my experience. The above are timings I observed on modern fast Ivybridge and Haswell based desktop machines with PCIe3 interface. You may see higher times on a slower laptop operating in battery mode.
BTW, the GTX 750 Ti is an sm_50 device, you probably would want to build native code for that architecture rather than sm_30 to avoid any possible overhead caused by JIT compilation (although in my understanding that overhead should be incurred at CUDA context initialization time).