I implemented a SAXPY program (y _= a*x + y) using CUDA C and PGI Accelerator C. for CUDA I set the number of threads per block constantly to 256 and varied the vector size (from 1024 to 16776960). I did the same for the PGI Accelerator implementation (schedule: parralel, vector (256) and the same vector sizes).
I compared the results of CUDA and PGI Acc obtained on a Nvidia C2050 GPU:
- GFlops of the whole program, i.e. kernel execution an data transfer: The PGI Accelerator values were always a bit below the GFlops of the CUDA implementation.
- GFlops only of the kernel: PGI Accelerator achieves more Gflops than CUDA up to a vector size of about 3 000 000. Then, its GFlops number is below CUDA’s.
Issue (2) is the one I don’t understand: Why is PGI Accelerator faster for a certain vector size?
My assumption/question: Is it possible that the internal optimizations may disable Fermi’s caching for small vector sizes?