I run a kernel function in both a CPU (i7 @ 3GHz) and GPU (geforce 9500 GT - 32 cuda cores - 550 Mhz clock ). I see that the speedup which derived by GPU usage is up to 2. With other word, i do not see big difference of gpu efficiency. I would expect to see more speedup (>20). Can be occur this (the kernel is very simple) or more likely to have inefficient kernel code?
You should be happy with the speedup you are seeing. You are comparing a top-notch CPU with 96 single precision GFLOP/s to an entry-level GPU with 89.6 single prec. GFLOP/s and still see a speedup. Are you sure your CPU code is well tuned?
Ok thank you for your answer. I want to ask you something else. I have a loop in which exist a kernel. For 100-200 first loops of for-statement the kernel executed at 0.000002 sec (very quick) after that loops the kernel run with 0.2 sec (logical time).
Why could be occur this?
Bear in mind that kernel invocations are asynchronous. You are probably missing to synchronize before stopping the timer, so that you are just measuring the launch overhead, not the kernel runtime. After a few hundred invocations the driver is probably throttling down scheduling of new kernel invocations.