example code on cuda 4.0

Hi! I have standart matrix multiplication program code on cuda. I need to use cuda 4.0. How I can do it? What is different from earliest version of cuda?
Possible anybody have example program code with cuda of earliest version of cuda and cuda 4.0 for for comparison.
Many thanks!

First step install CUDA. Second step compile with flag -arch=sm_20 to take advantage of all feature so of compute capability 2.0