Wish List for next OpenCL release

cudaprogrammer · June 20, 2009, 6:44pm

Emulation Debug mode for Visual Studio Projects. I develop code on a system that does not have a CUDA capable GPU and upload to a server with an FX 5800 to verify correct functionality and to time code. So, it would be very helpful to have a EmuDebug mode (like in CUDA Visual Studio projects).

jcornwall · June 21, 2009, 10:58am

Performance equivalent to CUDA? >.<

At the moment there are huge, obvious performance differences between the CUDA and OpenCL SDK examples (e.g. bandwidthTest). On top of that, the driver timers measure one thing while CPU-side timers around OpenCL API calls show relatively high levels of unattributable overhead.

It’s completely unusable at the moment. We’re using the current release to implement our computational framework and crossing our fingers that the driver performance might catch up at some point.

Simon_Green · June 25, 2009, 3:06pm

Thanks for the feedback.

We’re aware there are some performance issues with the current release, but rest assured that our goal is for OpenCL to always match (or exceed) the performance of the equivalent CUDA program.

cudaprogrammer · June 25, 2009, 11:14pm

jcornwall,

 I know it matters somewhat on specific system and device, but I was wondering what kind of performance degradation you are seeing using the OpenCL release.  When I wrap a shrDeltaT call around a clEnqueueNDRageKernel call on a trivial kernel, I can't get below around 10mS execution time.  I am used to seeing sub 1mS for this kernel using CUDA.  I know performance will get better with subsequent releases, but I am wondering if the large overhead your are seeing is in this range.

jcornwall · June 26, 2009, 9:23am

I find it quite hard to measure these things precisely. Due to the asynchronous nature of the OpenCL command queue (which is, of course, a good thing) host-side timing tends to require strategic barriers and waits placed in the right place. These then impact the overall execution time by themseles.

e.g. I think memory allocation is quite slow but clCreateBuffer() returns instantly and allocation only appears to occur at kernel execution time (for output - or for example at copy commands for inputs). So allocation time becomes intertwined with kernel execution time.

In a short performance test I wrote I see about 10ms for the clEnqueueNDRangeKernel, which is consistent with what you see. However, re-executing the kernel then appears to complete in 0.5ms. (That could be allocation right there.) On a different system I see 15ms execution which then bizarrely rises to 23ms on re-execution. Deallocation appears to be conclusively very slow - I get 75-100ms to free a 64MB data set. CUDA once had that problem but to NVIDIA’s credit it was fixed quite promptly once I reported it.

Lust1 · August 18, 2009, 11:31am

I wish for function pointers, virtual tables, whatever method different of (if/then/else) and (switch/case) for CALLING a function in OpenCL based on some index!!!

hpux735 · August 18, 2009, 6:49pm

Why?

Unless you have a really (really, really) good reason for those you shouldn’t use them. The way NVIDIA hardware works causes those constructs to perform very poorly. In the example of a switch statement you are saying that every thread in a warp has to wait for each different path through the switch, effectively serializing the threads and wasting resources. You might as well compute each value and use a “select” construct to choose the value that you want after they’ve been computed. This is detailed in the CUDA programming guide version 2.3 section 5.1.1.2.

My wish is also for CUDA-like performance…

_DK · August 29, 2009, 3:49pm

So how would you approach a problem when you need to build a complex equation in run-time (consisting of a set of simple kernels implemented in CUDA) and evaluate it for different parameters. So in fact there would be no serious branch divergence, as most of the functions are just simple mathematical expressions. Hence the threads would be mostly following the same code path. But you can not do this with static CUDA templates.

duanmu · September 3, 2009, 2:54pm

I wish the new release will be compatible with cuda driver v2.3 and higher instead of a single, beta version of cuda.

Keldor314 · September 9, 2009, 7:35am

Actually, it might not be that simple. The thing that stands out to me is that you’d have trouble figuring out warp convergence points. Right now, they probably use a simple heuristic that works great for static code, but once you jump to pointer foo, who knows when, or even if, you’ll return. So if that jump is conditional, where’s the point of reconvergence? In the worst case, you could end up with a divergence staying with you until the end of the kernal.

However, if the programmer can guarantee that the pointer foo jump is uniform across any warp, then this problem goes away.

Topic		Replies	Views
OpenCL vs Cuda performance on same kernels CUDA Programming and Performance	13	55648	July 15, 2010
OpenCL or CUDA? CUDA Programming and Performance	16	10961	October 26, 2011
Cuda OpenCL comparison cuda, openCL, nvidia CUDA Programming and Performance	19	42645	November 1, 2012
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7517	January 28, 2018
Any reason to choose CUDA over OpenCL? CUDA Programming and Performance	27	26050	August 2, 2010
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3981	February 26, 2010
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13451	July 9, 2008
OpenCL - hmm... not so interesting What is your take on it? CUDA Programming and Performance	16	13328	February 20, 2009
CUDA OpenCL comparison CUDA Programming and Performance	9	3403	August 23, 2011

Wish List for next OpenCL release

Related topics