Controlling device from host

I’ve been trying to figure out how to control the device while it’s executing. I realize that this isn’t really what GPU is all about, but nevertheless…
The basic idea is to allocate space to a variable in global memory of the device and have a pointer to that variable available to both host and device. When kernel is executed, it loops ( External Image ) on this variable; that is, device loops until host sets variable to a predefined “exit value” ( External Image i know, it’s terrible).
In my code I used cudaMalloHost to allocate memory for the variable, asynchronously launched the kernel using one stream and then asynchronously changed the value of the variable using another stream.
This resulted in an infinite loop because, from what I understand, the variable is read correctly by the device only on the first iteration through the loop. So on subsequent iterations, there is no change in value even though host asynchronously changes it using another stream.
Any ideas what I’m doing wrong? Is there a specific way to allocate memory to this “shared” variable so that the changes made by the host are reflected in the device while it executes?

P.S.
If my question is still unclear just let me know, and I’ll post my implementation.
-Andrey

Your self-documented admission of horror from such a terrible hack is well noted.

What you are trying to do IS possible but it’s also really sticky and ugly and confusing, even if it does work, it will likely give you headaches.

But I admit I have tried similar hacks. Even in the simplest form, sometimes it’s useful for the host to send a simple command like “STOP! I don’t need you anymore!” to a running kernel, and device memory is about the only way.

The programming docs (properly) don’t give you many promises about support for this. But you definitely want to mark your “control” memory as volatile so the compiler doesn’t optimize your repeated checks away. With that change, my simple commands-via-globals worked OK, but I’m still nervous about it.

Perhaps a more important question… take a step back. WHY do you need to perform active control like this? One reasonable answer is because of kernel launch overhead… but have you measured this to see if it’s really a problem?

Early abort is also a good example, you don’t want to waste compute on a stale process when it’s been invalidated, for example.

Well, the thing is I’m planning on sending more commands than just STOP. For example, let’s say there’s another shared variable for data. Host sets the data. So there are already two registers: state and data. Every time data is loaded into the data register by the host, the host then sets the state register to, let’s call it, “PROCESS_DATA” and then to something like “PAUSE” so that the same value is not processed multiple times. This will go on until host finally issues “STOP” command through the state register. The point of all this is pretty much for learning purposes only - I prefer learning through examples and even though parallel programming isn’t quite about what I’ve been describing I want to see how performance of GPU compares to a similar CPU implementation.

Wow, that’s ugly.

Are you not doing multiple kernel launches because you want to run as many iterations as possible in a given timeslice without the overhead of multiple kernel calls?

PS: spinlocking the GPU is a bad idea (which you’d have to do, I think, to implement a “pause” flag) if you care about power consumption at all.

That’s pretty much it. Launching a single kernel and supplying it with data through some kind of a synchronization mechanism.

By the way, the problem I was having is I forgot to declare the “register” variable as volatile… Thanks to SPWorley I realized it though.

Oh, and since we’re on the topic of synchronizing. Is __syncthreads() the only synchronization mechanism CUDA runtime API supports? I didn’t see anything in the manual about semaphore/mutex.

__syncthreads() is all there is. (Mutex/semaphore doesn’t work because of what I outline below.)

Also, I realized your method won’t work (well, won’t work past ~30k threads on a GT200). Once a block is dispatched to an SM, it runs until it completes. The maximum number of threads on an SM is 1024 (assuming you aren’t using too much shared memory or too many registers). If you ever spinlock, what happens is you won’t even start running certain blocks until you issue the stop command and some blocks start completing.

Basically you’re going to deadlock the card and this won’t work–sorry for not seeing this before, not sure why I didn’t since I wrote a simple deadlock application that did almost exactly this…

Wait, so the multiprocessors don’t switch among blocks, only for warps of active blocks?

The manual (in section 3.2) does say:

Also says:

Are they implying that once a block becomes “active” it is executed by one of the multiprocessors until all warps of that block are finished? As in there’s switching among warps within an active block but there’s no switching between blocks (which makes sense I guess because of inefficiency in loading/reloading data into hardware registers)?

Also, how do you determine maximum number of threads that can be executed concurrently (taking switching among warps into account)?

Yep, you’ve got it.

Take the maximum number of threads that can run concurrently on a single MP (from deviceQuery or the Appendix) and multiply by the number of MPs. This assumes 100% occupancy, though. You can use the occupancy calculator to find out how many blocks run concurrently on each MP given your code.

Gotcha. Thank you for your help guys! External Media