Comparison with FBO and nvcc option

Dear experts

I have implemented an optimization algorithm in CUDA which was originally written using frame-buffer object (FBO) and in GLSL.

The speed was 15% slower than the original one.

I didn’t use shared memory yet since I first wanted to know the performance change of the direct conversion.

I expected more similar performance and intended to have this as a starting point of further memory bandwidth optimization.

Is this additional overhead which is the cost for flexibility and generality over FBO?

Also, please advise me if there is any nvcc option for maximum speed (like visual studio compiler)

Thanks a lot in advance!