Wish List for next OpenCL release

Emulation Debug mode for Visual Studio Projects. I develop code on a system that does not have a CUDA capable GPU and upload to a server with an FX 5800 to verify correct functionality and to time code. So, it would be very helpful to have a EmuDebug mode (like in CUDA Visual Studio projects).

Performance equivalent to CUDA? >.<

At the moment there are huge, obvious performance differences between the CUDA and OpenCL SDK examples (e.g. bandwidthTest). On top of that, the driver timers measure one thing while CPU-side timers around OpenCL API calls show relatively high levels of unattributable overhead.

It’s completely unusable at the moment. We’re using the current release to implement our computational framework and crossing our fingers that the driver performance might catch up at some point.

Thanks for the feedback.

We’re aware there are some performance issues with the current release, but rest assured that our goal is for OpenCL to always match (or exceed) the performance of the equivalent CUDA program.

jcornwall,

 I know it matters somewhat on specific system and device, but I was wondering what kind of performance degradation you are seeing using the OpenCL release.  When I wrap a shrDeltaT call around a clEnqueueNDRageKernel call on a trivial kernel, I can't get below around 10mS execution time.  I am used to seeing sub 1mS for this kernel using CUDA.  I know performance will get better with subsequent releases, but I am wondering if the large overhead your are seeing is in this range.

I find it quite hard to measure these things precisely. Due to the asynchronous nature of the OpenCL command queue (which is, of course, a good thing) host-side timing tends to require strategic barriers and waits placed in the right place. These then impact the overall execution time by themseles.

e.g. I think memory allocation is quite slow but clCreateBuffer() returns instantly and allocation only appears to occur at kernel execution time (for output - or for example at copy commands for inputs). So allocation time becomes intertwined with kernel execution time.

In a short performance test I wrote I see about 10ms for the clEnqueueNDRangeKernel, which is consistent with what you see. However, re-executing the kernel then appears to complete in 0.5ms. (That could be allocation right there.) On a different system I see 15ms execution which then bizarrely rises to 23ms on re-execution. Deallocation appears to be conclusively very slow - I get 75-100ms to free a 64MB data set. CUDA once had that problem but to NVIDIA’s credit it was fixed quite promptly once I reported it.

I wish for function pointers, virtual tables, whatever method different of (if/then/else) and (switch/case) for CALLING a function in OpenCL based on some index!!!

Why?

Unless you have a really (really, really) good reason for those you shouldn’t use them. The way NVIDIA hardware works causes those constructs to perform very poorly. In the example of a switch statement you are saying that every thread in a warp has to wait for each different path through the switch, effectively serializing the threads and wasting resources. You might as well compute each value and use a “select” construct to choose the value that you want after they’ve been computed. This is detailed in the CUDA programming guide version 2.3 section 5.1.1.2.

My wish is also for CUDA-like performance…

So how would you approach a problem when you need to build a complex equation in run-time (consisting of a set of simple kernels implemented in CUDA) and evaluate it for different parameters. So in fact there would be no serious branch divergence, as most of the functions are just simple mathematical expressions. Hence the threads would be mostly following the same code path. But you can not do this with static CUDA templates.

I wish the new release will be compatible with cuda driver v2.3 and higher instead of a single, beta version of cuda.

Actually, it might not be that simple. The thing that stands out to me is that you’d have trouble figuring out warp convergence points. Right now, they probably use a simple heuristic that works great for static code, but once you jump to pointer foo, who knows when, or even if, you’ll return. So if that jump is conditional, where’s the point of reconvergence? In the worst case, you could end up with a divergence staying with you until the end of the kernal.

However, if the programmer can guarantee that the pointer foo jump is uniform across any warp, then this problem goes away.