You probably make a mistake when timing your kernels. Kernel calls are asynchronous so they return immediately after you called them. Use cudaThreadSynchronize() before starting and before stopping the timer to get accurate results.
Replace clock() with appropriate timing function (such as QueryPerformanceCounter() on WIndows). kernel1 execution time will be (t2-t1) and kernel2 execution time will be (t3-t2).