How many operations per cycle?

I have a question about the number of operations per cycle to make a performance comparison.
I’m using a GT240 GPU (96 Shaders, 1.34 GHZ Clock) which makes (theoretically) 257 GFlop (fmadd).
The best result I can get (dot product within a FIR Filter (multiply/add)) is 140 GFlop (not Memory Bound, everything in Shared Memory,no banking conflict in Shared Memory).

My question is which or how many operations (load, store, access to shared/constant Memory,add,…) are runing in parallel?

I’m comming from the Intel SMID world, where normally 3-4 Operations run in parallel.