OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k

Raistmer · April 11, 2011, 3:45pm

OpenCL kernel:

__kernel void GetPowerSpectrum_kernel_cl(__global float2* FreqData, __global float* PowerSpectrum) {
uint i= get_global_id(0);
float2 f1 = FreqData[i];
float p;
p=mad(f1.x,f1.x,f1.y*f1.y);
PowerSpectrum[i] = p;
}

CUDA kernel:
global void GetPowerSpectrum_kernel_cu(float2* FreqData, float* PowerSpectrum){
const int i = blockIdx.x * blockDim.x + threadIdx.x;
float2 freqData = FreqData[i];
PowerSpectrum[i] = freqData.x * freqData.x + freqData.y * freqData.y;
}

There are zero uncoalesced loads for OpenCL kernel but huge amount for CUDA one. Same data set, same launch geometry…

I definitely did smth wrong with CUDA kernel, but what is it? Help, please…

EDIT: and here is NVCC command line:
echo “p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” -gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20" --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include” -maxrregcount=32 --compile -o “…....\bin/AKv8/Win32/SSE3_OpenCL_NV/Intermediate/MB_CUDA_kernels.cu.obj” MB_CUDA_kernels.cu
“p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” -gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20" --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include” -maxrregcount=32 --compile -o “…....\bin/AKv8/Win32/SSE3_OpenCL_NV/Intermediate/MB_CUDA_kernels.cu.obj” “d:\R\SETI6\AKv8\client\MB_CUDA_kernels.cu”

Raistmer · April 14, 2011, 4:53pm

I did some more investigations.

When I revert project file to MSVC compiler (I need ICC for this project usually) and compile CU file I get such compile string in output:

\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe" -use_fast_math -prec-div=false -prec-sqrt=false -gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20" --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -DUSE_CUDA -Xcompiler “/EHsc /W3 /nologo /Ox /Zi /MT " -I”…....\bin/…/src" -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32 --ptxas-options=-v --compile -o “…....\bin/AKv8/Win32/SSE3_CUDA/Intermediate/MB_CUDA_kernels.cu.obj” MB_CUDA_kernels.cu
1>MB_CUDA_kernels.cu

reverted to ICC again and recived:

“p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” -G0 -use_fast_math -prec-div=false -prec-sqrt=false -gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20" --machine 32 -ccbin “P:\bin\Microsoft Visual Studio 9.0\VC\bin” -D_NEXUS_DEBUG -g -DUSE_CUDA -Xcompiler “/EHsc /W3 /nologo /Ox /Zi /MT " -I”…....\bin/…/src" -I"p:\bin\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32 --ptxas-options=-v --compile -o “…....\bin/AKv8/Win32/SSE3_CUDA/Intermediate/MB_CUDA_kernels.cu.obj” MB_CUDA_kernels.cu

But I don’t need and didn’t enable options in bold (!). I suspect these debug options cause such slowdown I described in first post.
Looks like bug? How to disable these options if in project oprions they are ALREADY DISABLED ???

Topic		Replies	Views
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1297	October 11, 2013
Significant speedup of OpenCL vs CUDA CUDA Programming and Performance	23	9824	February 12, 2022
OpenCL vs Cuda performance on same kernels CUDA Programming and Performance	13	55877	July 15, 2010
Performance comparison of CUDA and OpenCL CUDA Programming and Performance	2	1163	June 3, 2016
why cuda is slower than opencl CUDA Programming and Performance	7	2109	April 6, 2016
OpenCL runs faster than CUDA and PTX version weirdness.... CUDA Programming and Performance	2	2616	March 4, 2010
CUDA performance vs. openCL performance CUDA Programming and Performance	7	12524	June 8, 2012
Any reason to choose CUDA over OpenCL? CUDA Programming and Performance	27	26447	August 2, 2010
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7731	January 28, 2018
Cuda OpenCL comparison cuda, openCL, nvidia CUDA Programming and Performance	19	43080	November 1, 2012

OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k

Related topics