clGetEventInfo seems to be not always working for EXECUTION_STATUS. The first few kernel executions, it behaves as expected, and promptly resolves itself to CL_COMPLETE. However, after that, it seems to get stuck returning CL_QUEUED for a very long time. Eventually, after perhaps a minute, it gets tired of that and sends another CL_COMPLETE, but then promptly gets stuck on CL_QUEUED again. It seems to get stuck after precisely the 5th call - since I’m using it to load balance multiple devices, it would be the 3rd call for that specific command queue. Both devices also seem to get “unstuck” at exactly the same time. Could this be some sort of a timeout?
I do not know the answer to your question, but offer my load balancing technique, which works very well. I sort of take it completely out of OpenCL’s hands & put it in the host’s. I build an array of contexts, each with a single command queue. The # of contexts == # GPU’s.
I create a queue on the host, to submit work to. A host thread pool is also created, where the # of threads == # of contexts, so each pool thread is assigned a context. These pool threads grab work from the host queue at their own pace and execute it on their context. When the queue is complete, I grab global memory from each context, and come up with a final result in the host language.
Java has a concurrency package as one of it’s standard libraries. That does most of the work for me. This does not sound as near as attractive if you had to implement everything yourself. What comes with MFC, if you are on Windows, I do not know. OpenCL command queues do not report how much is waiting, so as you know this means you need to do a lot of event book keeping to figure out which command queue gets the next unit of work. There is no decision making with the host queue / thread pool method. It just happens. The GPU’s do not need to be the same size. Each just work as fast as they can.
Why would you have # of contexts == # GPU’s instead of a single context within which # of queues == # GPU’s? Queues are tied to devices, not contexts.
If you have a single context with multiple devices, then OpenCL moves about in the background moving data between them in an attempt to keep them coherent. Truly horrible design decision, since a PCIe transfer is very nearly as slow compared to device memory as the hard drive is to system memory. Believe me, I discovered this the hard way >.<
My primary reason was simplicity. Kernel arguments cannot be set on a command queue basis. So even if you have a single context, you need a kernel for each queue. Just making the contexts completely separate is more elegant, helpful when you want to go back and look at your code 6 mos later. You also may need separate buffers, unless they are just for one time input.
There is also one thing you can do when you have separate contexts: use GPU’s from different platforms. This played very little in my decision though.
What Keldor said about coherence, I never experienced it, but I never even tried that method.
Does it really work like this? Even when you use different buffers/images on different devices? I mean I don’t see any reason for any synchronization when a single buffer is used only on a single device and it is never accessed on others. Actually even all provided examples from the OpenCL API are implemented this way (one context with N devices and N buffers, 1 buffer for 1 device).
The copy is only made withing the context if a buffer really is shared. That is if a kernel on one device attempts to access a buffer that’s currently on another device. If each kernel works on its own buffers, no copies should be made. oclSimpleMultiGPU example uses a single context across all GPUs, with individual command queues and buffers for devices.
Yes, that’s how I expect it to work (plus I suppose there should be no synchronization in case when all devices only read from some memory… of course except the initial copy of the memory to all devices). It’s just that Keldor sounded like the synchronization is applied even when it is not necessary so I’m not quite sure whether it really works as I expect. Unfortunately I don’t have multiple GPUs right now to try it.
Sorry, if I wrecked your thread. Back to load balancing. Some sort of work around could take advantage of the fact that command queues are in fact queues. If you are pushing work to them & that work is fairly uniform in run time, allocating in proportion to the relative # of compute units of each device seems plausible. Being queues, they do not have to be ready when work is submitted to them.
With the cpu queue / device pull method, I am currently using a finish() at the end of a kernel set, but am about to change. finish() ties up a core just waiting, although the cpu might swap it out at some point. This limits other work the cpu could be doing. I already calibrate workgroup sizes of individual kernels to GPUs at installation time. If I also do the same with kernel sets, I could put in a thread sleep for let’s say 85% of the average set time for that GPU, and then call finish().
This thread got me more actively thinking about how I was going to modify my finish. It is not strictly said that you have to call clflush to check an events status, only when waiting on an event. You are probably already calling clflush after every kernel set queued, but also calling it just before you check the status might knock out the cob webs.
Yep, adding a clFlush for each event query fixed it. But surely this is still a bug? Shouldn’t the command queue get submitted regardless of whether I actually manually flush it or not?
I bet you’re on Vista or Win7.
It is a bug if you are already putting in a flush as a part of the queue submission(you are not clear that you are also doing this). The Khronos link earlier says finish(), waitForEvents(), etc. all perform an implicit flush. If it has to be done by each of the blocking calls, it seems implied that you have to explicitly do it yourself, when you do not want to block.
Having to also do it in a loop, just before a event status check, aka. the giggle the handle move, sounds like a bug to me. I also have an idea of how they are going to fix it.
Yep, this is running on Vista. I take it this is another WDDM “feature”?
Correct. Launching a kernel may not actually trigger the hardware to start running it until certain conditions are met.
I’ve noticed another issue, where memory copies may not return if blocking is set to true. Seems to happen rather at random, and putting in a flush before the copy only sometimes works.
Actually, it looks like the flushes have not in fact completely solved the problem. While the events mostly update properly, they still get stuck occasionally.