executing the exe vs using profiler with cuda toolkit 6.5 shows better performance in profiler

I have a simulation example and used cuda toolkit 6.5 with knowledge thata it has optimized performance in mathematical function like sqrt and other. however running directly exe has execution time as 51 sec whereas in profiler shows execution time as 29 sec.

why is this such difference in version in 6.5?? visual profiler is supposed to be slower than direct exe execution. this shows that for better performance in terms of execution, running in visual profiler is better than direct exe execution??