CUDA code Performance on different GPUs.

Hello Everyone,

I am doing a project in which i have to make a flow chart of a same code on different GPUs (all are NVIDIA’s GPU). Initially the code was written for Ge-Force 310 which have only 16 CUDA cores and can accmodoate 512 threads per block. Later the respective execution performance in terms of time was noted. Now the same code without any modification is executed on GTX-680 which have 1536 CUDA cores and cane accommodate 1024 threads per block. When the time was execution performance for this CUDA was noted it was approximately same as with the previous GPU, even though the CUDA cores and other specifications of GTX-680 are very high as compared to Ge-Force 310.

So now my questions are

  1. What are the factors on which the performance of the GU matters?
  2. Am I doing right or there must be some modification needed for the new GPU?
  3. The code have 2 FFT using CuFFT library, so the performance of cuFFT has to be changed for both the GPU or it will be same?

I am newbie to the GPU computing world so don’t have much idea about the sense of the questions but i am looking forward to learn from it.


  1. Launch enough thread blocks to saturate even future generation of GPUs. i.e. more thread blocks to achieve full occupancy on all SMX on current GPUs. If you only launch two thread blocks, then that explains why performance isn’t increased when running it on a GTX 680.

  2. can’t tell without having access to your source code, really.

  3. CuFFT should scale quite well with the cores on the GPUs, assuming your problem size is big enough (a single 256 point FFT probably won’t scale - but a batch of many such FFTs or a much larger single FFT (say 64k samples) will).