Broken OpenCL Runtime in Latest Drivers - Not Spec Conformant?

Hello,

This issue applies for 346* and 352* for linux x86_64. You can see two of my discussions on other websites here (the first one is more detailed):

  • https://www.khronos.org/message_boards/showthread.php/11619-NVIDIA-Multi-Device-Command-Queue-Concurrency-Issue
  • http://stackoverflow.com/questions/31758669/opencl-multi-device-commandqueue-concurrency-issue-nvidia
  • Essentially, my software, which absolutely should execute in parallel w.r.t. clEnqueue* calls on different (device exclusive) CommandQueues according to page 25 of the OpenCL 1.1 spec:

    " It is possible to associ ate multiple queues with a single context. These queues run concurrently and independently with no explicit mechanisms within OpenCL to synchronize between them. "

    is not doing that at all, regardless of whether I specify in, or out-of-order execution (that shouldn’t matter anyway). I did confirm that CUDA applications can run concurrently (see first post, near bottom of first page).

    I would appreciate some NVIDIA support on this, as it seems to imply that a fundamental component of the OpenCL spec is not working as advertised (if this is not a problem on my end, this is a serious issue). The drivers are advertised as supporting OpenCL 1.1 (and now 1.2), and they should probably do what they claim to…

    Additionally, if someone here is running 2+ AMD cards, I would appreciate it greatly if you could confirm that my code runs concurrently with either textual or graphic profiler output. You can clone my repo here:

    https://github.com/stevenovakov/learnOpenCL

    please checkout the simple_events branch only (as it is how I would write a typical OpenCL application).

    You may want to file a bug at developer.nvidia.com

    I’m not an OpenCL expert.

    It’s evident from the second profiler “output” depicted on your stackoverflow posting that it is manufactured from the first.

    It’s possible that the H->D overlap and the D->H overlap that you are depicting is not possible due to your specific motherboard configuration. If the GPUs are sharing a host PCIE link (i.e. there is a PCIE switch on your motherboard), then the transfers to one device will serialize with the transfers to the other device. That’s a somewhat less common motherboard config (though certainly possible) and since I don’t know what your motherboard config is, I’ll leave it at that.

    I don’t put a lot of stock in the above comment, because even if that were the case, the kernel call to the first GPU should overlap with the copy operations to the second GPU and I don’t see that, but I don’t know all the nuances of buffer pinning in openCL that would be needed to observe that.

    Thanks for the reply.

    Yep, as stated, it is manufactured (kolourpaint4 hack job…). I made that clear.

    My CPU has 40x transceivers, I’m using Asus X-99A which has support for up to 3 way SLI. I don’t think this comes into play with X99, at least for 2 GPUs. As only one PCIe3.0x16 slot (the third, occupied by something that isn’t a GPU atn) is specified to be behind a switch. I could understand this being a concern for Z97 boards, for example, with CPUS having only 16 transceivers…

    Thanks, I will report it as a bug.