I have tried to match the speed of Sobel filter example from CUDA SDK and came to a conclusion that those 40,000 â€“ 70,000 fps on a typical Windows Vista computer with say GeForce 9800 GT is an inaccurate measurement due to CUDA code. Please prove me wrong. The way code measures time is a complete phony!!!
The code occasionally runs Sobel in a display() procedure of open GL and then calls start and stop timer functions. This will give wrong timing since any global Kernel() exits code asynchronously to pass the control back to the host without even running a substantial part of its code. Thus measuring time from Kernel start to its passing control to the host has nothing to do with the actual execution time on the device.
In fact requesting a full completion of code on the GPU by using function cudaThreadSynchronize() will result in fps = 60 or so in Release mode â€“ 3 orders of magnitude slower then CUDA displays. Now, this speed is also inaccurate since it underestimates the effect of thread sliding. But the difference is tremendous.
The right way to measure time is probably to launch 1000 Kernels, then call cudaThreadSynchronize() and then average the time. Using a video buffer as output may also impose some restrictions so it is better to dump the result in regular global memory.