For sequential programming (w/o data parallelism possible) I know it is better to use a CPU than a GPU. But i need the result from this sequential part and pass it to a kernel and this has to be done iteratively many times . So i want to perform this sequential programming also in a GPU(thread) just to minimize the memory trasnfers. My question is the performance of a thread used durin sequential programming on par with a CPU ??
If your code is purely sequential (ie: it executes on just 1 Scalar Processor), the performance will fall far behind a modern processor core (ie: Intel core micro-architecture on Core2 Duo), in a factor of 5X to 20X slower.
Worse, you will need to change the algorithm to use Shared Memory instead of Global Memory as much as possible, because memory accesses that are cached on modern processors are not cached on CUDA Global Memory and will impact tremendeously the performance level! A simple loop working in memory may be 100X slower on CUDA Global Memory than on actual CPU core with CACHED Memory!
On the other side, you may use different techniques to improve performances of sequential code, ie: macro-threading to implement a programmeable prefetcher/write-back cache system ( see on my cudachess blog ) that WONT IMPACT the execution time of your computations :-)
And better you may use micro-threading (segmenting pseudo-sequential code) into threads on the same warp (block of 32 threads) to accelerate things, on a way similar to parallel execution in modern CPU. (check my blog, it will come :-) )