OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k

OpenCL kernel:

__kernel void GetPowerSpectrum_kernel_cl(__global float2* FreqData, __global float* PowerSpectrum) {
uint i= get_global_id(0);
float2 f1 = FreqData[i];
float p;
p=mad(f1.x,f1.x,f1.y*f1.y);
PowerSpectrum[i] = p;
}

CUDA kernel:
global void GetPowerSpectrum_kernel_cu(float2* FreqData, float* PowerSpectrum){
const int i = blockIdx.x * blockDim.x + threadIdx.x;
float2 freqData = FreqData[i];
PowerSpectrum[i] = freqData.x * freqData.x + freqData.y * freqData.y;
}

There are zero uncoalesced loads for OpenCL kernel but huge amount for CUDA one. Same data set, same launch geometry…

I definitely did smth wrong with CUDA kernel, but what is it? Help, please…

EDIT: and here is NVCC command line:
echo “p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” -gencode=arch=compute_10,code=“sm_10,compute_10” -gencode=arch=compute_20,code=“sm_20,compute_20” --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include” -maxrregcount=32 --compile -o “…\bin/AKv8/Win32/SSE3_OpenCL_NV/Intermediate/MB_CUDA_kernels.cu.obj” MB_CUDA_kernels.cu
“p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” -gencode=arch=compute_10,code=“sm_10,compute_10” -gencode=arch=compute_20,code=“sm_20,compute_20” --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include” -maxrregcount=32 --compile -o “…\bin/AKv8/Win32/SSE3_OpenCL_NV/Intermediate/MB_CUDA_kernels.cu.obj” “d:\R\SETI6\AKv8\client\MB_CUDA_kernels.cu”

I did some more investigations.

When I revert project file to MSVC compiler (I need ICC for this project usually) and compile CU file I get such compile string in output:

\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe" -use_fast_math -prec-div=false -prec-sqrt=false -gencode=arch=compute_10,code=“sm_10,compute_10” -gencode=arch=compute_20,code=“sm_20,compute_20” --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -DUSE_CUDA -Xcompiler “/EHsc /W3 /nologo /Ox /Zi /MT " -I”…\bin/…/src" -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32 --ptxas-options=-v --compile -o “…\bin/AKv8/Win32/SSE3_CUDA/Intermediate/MB_CUDA_kernels.cu.obj” MB_CUDA_kernels.cu
1>MB_CUDA_kernels.cu

reverted to ICC again and recived:

“p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” -G0 -use_fast_math -prec-div=false -prec-sqrt=false -gencode=arch=compute_10,code=“sm_10,compute_10” -gencode=arch=compute_20,code=“sm_20,compute_20” --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -D_NEXUS_DEBUG -g -DUSE_CUDA -Xcompiler “/EHsc /W3 /nologo /Ox /Zi /MT " -I”…\bin/…/src" -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32 --ptxas-options=-v --compile -o “…\bin/AKv8/Win32/SSE3_CUDA/Intermediate/MB_CUDA_kernels.cu.obj” MB_CUDA_kernels.cu

But I don’t need and didn’t enable options in bold (!). I suspect these debug options cause such slowdown I described in first post.
Looks like bug? How to disable these options if in project oprions they are ALREADY DISABLED ???