If your code is purely sequential (ie: it executes on just 1 Scalar Processor), the performance will fall far behind a modern processor core (ie: Intel core micro-architecture on Core2 Duo), in a factor of 5X to 20X slower.
Worse, you will need to change the algorithm to use Shared Memory instead of Global Memory as much as possible, because memory accesses that are cached on modern processors are not cached on CUDA Global Memory and will impact tremendeously the performance level! A simple loop working in memory may be 100X slower on CUDA Global Memory than on actual CPU core with CACHED Memory!
On the other side, you may use different techniques to improve performances of sequential code, ie: macro-threading to implement a programmeable prefetcher/write-back cache system ( see on my cudachess blog ) that WONT IMPACT the execution time of your computations :-)
And better you may use micro-threading (segmenting pseudo-sequential code) into threads on the same warp (block of 32 threads) to accelerate things, on a way similar to parallel execution in modern CPU. (check my blog, it will come :-) )