I have a kernel I’m testing, that I’ve written in OpenCL and CUDA. The first unexpected thing I notice is that the kernel runs much slower on CUDA than on OpenCL. The difference is major (400ms compared to 190ms), and is entirely in the execution of the kernel itself. I’ve tested this on a 6800 and an 8800, using 195.39 driver and SDK version 3.0beta.
I use cutil_math.h for the vector maths. I also use OpenCL as_float, and native_recip, but looking at the OpenCL PTX my simple CUDA replacements for these seem to be generate sensible code, and they are not being used enough to account for a 2x times difference.
I’m experimenting with different compiler settings to see if that could be the issue. I tried different optimization settings (–optimize, -use_fast_math, and -maxrregcount) to no avail. However one potential issue I see is the the CUDA compiler is generating verson 1.4 PTX but OpenCL is generating version 1.5 PTX(though still using sm_11).
Is there a way to generate version 1.5 PTX with NVCC ? Any other ideas as to what the slow down could be ?