temporal beamforming about performance...

Hi all !

I’ve coded a temporal beamforming an a Tesla C870.
I took good care on programming coalescent memory transactions and the least calculation on addresses and integers, as far as i could.

The performance i obtained was around 3.2 Gflops.

Could any one who did something similar tell me if such a performance is relevant of such an algorithm coded with CUDA ?

Thanks