OpenCL runs faster than CUDA and PTX version weirdness....

I have a kernel I’m testing, that I’ve written in OpenCL and CUDA. The first unexpected thing I notice is that the kernel runs much slower on CUDA than on OpenCL. The difference is major (400ms compared to 190ms), and is entirely in the execution of the kernel itself. I’ve tested this on a 6800 and an 8800, using 195.39 driver and SDK version 3.0beta.

I use cutil_math.h for the vector maths. I also use OpenCL as_float, and native_recip, but looking at the OpenCL PTX my simple CUDA replacements for these seem to be generate sensible code, and they are not being used enough to account for a 2x times difference.

I’m experimenting with different compiler settings to see if that could be the issue. I tried different optimization settings (–optimize, -use_fast_math, and -maxrregcount) to no avail. However one potential issue I see is the the CUDA compiler is generating verson 1.4 PTX but OpenCL is generating version 1.5 PTX(though still using sm_11).

Is there a way to generate version 1.5 PTX with NVCC ? Any other ideas as to what the slow down could be ?

Thanks all.

I get an even bigger difference in performance on a GTX 285. My OpenCL kernel completes in 24ms, my CUDA one in 75ms. The PTX versions are the same (1.5 for OpenCL, 1.4 for CUDA), though the compute architecture is now sm_13 in both cases.

If anyone is interested here is cutdown version of my kernel it that shows the same problem. On my GTX 285 running this kernel on a million data items (using a workgroup size of 64, bigger ones are slight less efficent) the CL code will take 7.4ms the CUDA code 19.1ms.

“C:\CUDA\bin\nvcc.exe” -use_fast_math --optimize 4 -arch sm_13 -I"C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\inc" -ptx -o SlowOnCUDA.ptx SlowOnCUDA.cu

Here is the code that caused the discrepancy:

int offset=0;

	for (i = 0; i < 16; i ++) {

		if ((i == 1) && (!(t_bits & (0x1 << 16))))

			break;

		float4 v0 = ((__global float4 *) (arg0 + offset))[i + 0];

		float4 v1 = ((__global float4 *) (arg0 + offset))[i + 1];

		float4 v2 = ((__global float4 *) (arg0 + offset))[i + 2];

		if (i == 0)		

			t_bits = as_uint(v2.w);

		float4 a = v1 - v0;

		float4 b = v2 - v0;		

		const float t = dot(a, b);

		if (t < 1e-4f || t > rt)

			continue;

		rt = t;

		q = (i);		

		break;

	}	

	arg5[gid].a=make_float4(as_float(q),  0,0,rt);

I’ve included the code and the CUDA wrapper to get it compile (does very little except include “cutil_math.h”), and the PTX for both CL and CUDA. One interesting factoid is that removing the offset so the code reads like that below removes the discrepancy:

for (i = 0; i < 16; i ++) {

		if ((i == 1) && (!(t_bits & (0x1 << 16))))

			break;

		float4 v0 = ((__global float4 *) (arg0))[i + 0];

		float4 v1 = ((__global float4 *) (arg0))[i + 1];

		float4 v2 = ((__global float4 *) (arg0))[i + 2];

SlowOnCUDA.zip (4.74 KB)