Yes, the number of multiprocessors is 8 in all current CUDA cards. The warp size is also fixed for all current cards at 32, although you query the warp size at runtime. Future devices may have different values.
Each GPU has multiprocessors ranging to 16 at the top models. Each mp now has 8 SIMD processors.
The warps size of 32 is easy explained if you go into detail of the instruction set.
The most common instructions take 4 clock cylces. As each warp can issue 8 threads at a time (remember the 8 SIMD processors), it takes 4 steps to issue all threads. At this point the pipeline is finished for the first 8 threads.
You can ensure scaling by splitting up your problem to as many blocks as possible. As long as you do this and keep at least 64 threads per block (== 2 warps), your algorithm should be independent of the amount of multiprocessors
In my opinion the number of SIMD procs per MP will not increase too much, as SIMD constraints then become a problem for scientific computing.