Report of OpenCL

Hi I have been playing for a week with the OpenCL conformance candidate (on Windows and Visual Studio 2008) and here are my findings, questions and suggestions about OpenCL Nvidia implementation and comparisons to CUDA…
I provide as a PDF for not being a long report…
OpenCL_Report.pdf (39.7 KB)

I just read it and found that you got lots of finding, THANKS FOR SHARING! I haven’t got some time to play with the release since all my machines are 64bit…

One thing surprise me is error correction support flag (CL_DEVICE_ERROR_CORRECTION_SUPPORT) is set to 1 on GT200. Does that mean the GT200 itself is capable of using ECC memory (although there’s no such GeForce/Tesla card exist to date)? I know that ECC support is coming to CUDA maybe in GT300, but I’ve never heard of that in current hardware (GT200).

Also, though I didn’t dig into OpenCL spec, could you explain the CLK_GLOBAL_MEM_FENCE and CLK_LOCAL_MEM_FENCE? You said __syncthreads() = barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE)…does that mean I can have global __syncthreads() that synchronize among all blocks instead of just one local block?

Grazie, Oscar.

I am grateful to all people that find it useful! Thanks!

Regarding __syncthreads() = barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE) please don’t get also confused (as I also first without reviewing the spec carefully)…

the barrier is always for on local block and CLK_GLOBAL_MEM_FENCE only says that the barrier also gaurantes that all accesses by all threads in the block to global memory are properly visible after the barrier which is a

feature of __syncthreads() which garantes this consitency memory model for local (shared) and global memory…

Another reason to post the report is for not opening also a thread for every question in the report… but to start discussion if possible on this thread…

After writing the report I think I can briefly question the most important worries/wishes I have at the moment:

  1. Unify at the lowest level the CUDA and OpenCL stack… by that I mean:

A. (CODE)That PTX codes for kernels generated from CUDA and OpenCL compilers can be exchanged as wanted. And also why not, expose a builtin compiler in CUDA like in OpenCL but not only for PTX code also for C for CUDA…

In the report I show differences in the PTX code generated by these two backends… Are they going to unify or this is a long term fact?..

B. (MEMORY) That device memory spaces are the same for CUDA and OpenCL… That would solve the question #2 and also put to good use the efforts in CUDA 2.2 of “'relaxing” some limitations of pinned memory (making it portable to multiple GPUS for example, the zero copy model,etc…)… That would allow using basically cudamalloc,cudamallochost,cudahostalloc,cudamemcpy(D2H,H2D),

etc… obviating some less advanced/efficient (?) OpenCL memory APIs…

  1. Pinned memory in OpenCL? It’s possible? I doubt so, seeing the bandwithtest and that the “pinned” mode of this test allocates system memory with malloc… and seems the API doesn’t provide a method of allocating this type of system memory (ala CUDA mallochost) … A possible solution (only for NVIDIA cards) is in question #1?

Would be good to know the opinions of someone at Nvidia about this questions…

Thanks.

Oscar,

Thanks!

Did any of you try compiling your OpenCL code for AMD and NVIDIA and do you note any performance difference?

Are kernels performing as good as native kernels?

Are there OpenCL drivers yet for AMD?

They were boasting about it long back. But I am not too sure if they have. I have never worked on it before. (but reading the spec and brooks - it looks interesting as well).