Convolution Separable Sobel vs my simple method ! separable kernels not helping improve speed

I implemented the convolution separable method for a sobel filter. The method followed is this. Then I implemented a sobel filter just using texture memory and registers. I am attaching the codes and CUDA Visual profiler results of both these methods. One of the major differences is that I used float in convolution separable, and I used uchar4 in my own method. Can anyone please explain why the convolution separable is taking so long. The attachment is a zip file with two codes and their visual profiler analysis. (5.19 KB) (5.96 KB) (4.58 KB)

50 views and not a single reply ? I seriously need some help.