Hi all !
I’ve coded a temporal beamforming an a Tesla C870.
I took good care on programming coalescent memory transactions and the least calculation on addresses and integers, as far as i could.
The performance i obtained was around 3.2 Gflops.
Could any one who did something similar tell me if such a performance is relevant of such an algorithm coded with CUDA ?