No big gap in performance between GeForce gtx 970 vs GeForce gtx 750

I have a cuda project developed using GeForce gtx 750 in visual studio 2012.
I changed GeForce gtx 750 card to GeForce gtx 970 card to increase the performance.
NVIDIA site’s performance says GeForce gtx 970 card is faster over twice than 750.
But it didn’t in my project.
What should i do in my project after changing gpu card?
Thanks in advance.

I assume you are already using CUDA 6.5 and the latest drivers? If so, other than compiling for the correct compute capability (GTX 750 Ti is compute capability 5.0, GTX 970 is compute capability 5.2) there is nothing you need to do.

The specifications of the two cards that I managed to track down would seem to suggest about 2.6x performance difference:

GTX 750 Ti: memory bandwidth 86.4 GB/sec, 1306 single-precision GFLOPS
GTX 970: memory bandwidth 224 GB/sec, 3494 single-precision GFLOPS

However, nothing is known about your application and its bottlenecks, nor do you state the speedup you are seeing, and how it was measured. Your application’s performance may not be limited by GPU performance at all, but could be limited by the host system performance, file I/O, or PCIe throughput (to list some possible causes). Your GPU code may not provide sufficient parallelism to fully utilize the faster GPU.

I would suggest using the CUDA profiler to drill down on the exact performance differences for the CUDA kernels in your application.

My application is ada-boost face detector and saw 1.5x performance boost which is not full boosting.
I think the most important things is the number of threads per block and the number of regs per block.
I set the both of them are 32 calculating the occupancy by cuda_ocupancy_calculator.xls when using gtx 750 wherein occupancy was 50% in almost functions, but calculator was saying the best performance will be getting at the setting.
These settings fit in gtx 970, too?

It is not really possible to give specific advice based on a vague description of an application.

Personally, the first thing I would check is whether the application is running a sufficiently large total number of threads. I do not have personal experience with these specific GPUs, but for the GTX 970 that is probably on the order of 15,000 concurrently running threads or more. My next step would be to double check build settings (debug vs release, correct target architecture) and the correct configuration of all code conditionally compiled based on GPU architecture (for example, #ifdef based on CUDA_ARCH).

If I understand your comment correctly, your app uses only 32 threads per thread block. That strikes me as very low. Typically thread counts of around 128 to 256 threads per thread block deliver optimal performance. This is just a rule of thumb, of course.

Beyond these simple sanity checks, the CUDA profiler is the best tool to tackle performance issues: it can highlight specific bottlenecks and may even provide relevant recommendations.

I tried to analysis my application by visual profiler.
But unable to profile my application, it returned the message “unknown error vprof return value 255”
It was a pity.

Impossible to diagnose, it could be anything staring with a faulty installation. You do not need the full visual profiler for some basic analysis. Try the command line profiler, or even the minimal profiling capability built into the driver (export the environment variable CUDA_PROFILE=1 to turn it on; remember to unset that variable when you are done). The latter can give you the execution time and occupancy for each kernel invoked. The profiling data goes to a file called cuda_profile_*.log by default.