Convolution Separable Sobel vs my simple method ! separable kernels not helping improve speed

I implemented the convolution separable method for a sobel filter. The method followed is this. Then I implemented a sobel filter just using texture memory and registers. I am attaching the codes and CUDA Visual profiler results of both these methods. One of the major differences is that I used float in convolution separable, and I used uchar4 in my own method. Can anyone please explain why the convolution separable is taking so long. The attachment is a zip file with two codes and their visual profiler analysis.
sobel.zip (5.19 KB)
sobel_separable_shared_tex.cu (5.96 KB)
sobel_tex_uchar4.cu (4.58 KB)

50 views and not a single reply ? I seriously need some help.