Unordinary performance gap between OpenCL and CUDA

Hi there,
I asked my question in Stackoverflow about 4days ago but I didn’t receive any suitable answer. I would like to save the time and refer you to the stackoverflow question at:

Any help and idea would be appreciated!

Thanks,