Neither is wrong. Since under CUDA multiple copies of the instruction are issued for the warp (which is 32 threads), if you think of the 575MHz clock then you need 2 clocks to issue, say, add instruction for an entire warp. Similarly, if you think of the clock as being 1.35GHz, then you need 4 clocks for the same purpose. At the end it’s all really the same since warp is CUDA’s compute granule.
The GPU has two clock “domains”. The 575 MHz clock is used for various graphics functions like triangle setup and rasterization. It isn’t used much by CUDA. The multiprocessors (“shaders”, in graphics) run at 1.35GHz, and this is the clock domain that affects CUDA.
768 MB is the amount of DRAM on the 8800 GTX board. This is the off-chip memory can be used as “global” or “local” memory in CUDA, as well as texture memory. It is also used to store CUDA programs, context data used by the driver, the windows / X desktop if it is active, and any data currently used by the graphics APIs. This doesn’t include the on-chip shared memory.