Cuda blurring filter running too slow on gstdsexample using GpuMat!

The code is an exactly copy of gstdsexample.cpp in the SDK. The only thing that I have added is the GpuMat section where I read the frame from the buffer (with EglImage) and blur it. I have provided the code for this part in my original post. I implemented this part according to Nvidia’s suggestion here.

All I am asking is that why the blurring is being done so slow! It is supposed to be done in GPU (since we have a GpuMat). But, it is 5 times slower than it should be!