on the fly polling of buffers.

For my algorithm, I need to poll a buffer constantly on the cpu while a kernel is running and possibly modifying that buffer… Essentially, Im trying to implement some sort of paging support, the kernel adds requests to a request buffer, the cpu polls the buffer and processes the requests while the kernel is in operation…

Is this possible using zero copy memory?

Thank You,
Debdatta Basu.

For my algorithm, I need to poll a buffer constantly on the cpu while a kernel is running and possibly modifying that buffer… Essentially, Im trying to implement some sort of paging support, the kernel adds requests to a request buffer, the cpu polls the buffer and processes the requests while the kernel is in operation…

Is this possible using zero copy memory?

Thank You,
Debdatta Basu.

Yes. But it’s not guaranteed! And it’s painfully temperamental and since it’s not a guaranteed behavior it could change at any time.

But I have done it myself… I have kernels that run for hours sometimes and I wanted to be able to print status messages back to the CPU, and even have the CPU pass new data into the running GPU kernel. I just used zero copy. You have to keep polling on both GPU and CPU, and on the GPU you have to be super-careful because you don’t want a cached memory read… so you need to make sure it’s declared volatile.
But again, this is all playing with fire, the right way to do CPU/GPU intercommunication and synchronization is with kernel launches and streams. Zero copy polling and syncronization games are dangerous.

The next post in this thread after this one will be tmurray saying “Don’t do that.”

Yes. But it’s not guaranteed! And it’s painfully temperamental and since it’s not a guaranteed behavior it could change at any time.

But I have done it myself… I have kernels that run for hours sometimes and I wanted to be able to print status messages back to the CPU, and even have the CPU pass new data into the running GPU kernel. I just used zero copy. You have to keep polling on both GPU and CPU, and on the GPU you have to be super-careful because you don’t want a cached memory read… so you need to make sure it’s declared volatile.
But again, this is all playing with fire, the right way to do CPU/GPU intercommunication and synchronization is with kernel launches and streams. Zero copy polling and syncronization games are dangerous.

The next post in this thread after this one will be tmurray saying “Don’t do that.”

Its encouraging to hear someone has done this! I wonder why this is not a fully supported feature? This will give way better GPU utilization for kernels that need unpredictable access patterns on out of core memory(say hard disk or network)…

By the way, what temperamental behavior did your implementation show?

And thanks for the one up about the volatile part… I wouldn’t have remembered that, and it would have led to a lot of frustration. :)
But why does this only apply to GPU? shouldn’t the cpu worry about cached accesses too?

Cheers!
Debdatta Basu.

Its encouraging to hear someone has done this! I wonder why this is not a fully supported feature? This will give way better GPU utilization for kernels that need unpredictable access patterns on out of core memory(say hard disk or network)…

By the way, what temperamental behavior did your implementation show?

And thanks for the one up about the volatile part… I wouldn’t have remembered that, and it would have led to a lot of frustration. :)
But why does this only apply to GPU? shouldn’t the cpu worry about cached accesses too?

Cheers!
Debdatta Basu.

It’s not guaranteed that memory writes on the CPU will be seen by a running GPU kernel since there’s no synchronization primitive that enforces the ordering except for a kernel launch and completion… which is what you’re deliberately avoiding.

There is a GPU -> CPU synchronization, __threadfence_system(). (See section B.5 of the programming guide). It’s only for Fermi.

My running-kernel tool works but it’s got way more latency than I expected. And atomic updates work but fail sometimes (1 in 50000 times) and you can get double atomic increments or skipped values… Again this is NOT an error, it’s not even suggested that atomic accesses should work at all.

I should write up my library… it’s actually quite useful but a little too complex for a forum post.

It’s not guaranteed that memory writes on the CPU will be seen by a running GPU kernel since there’s no synchronization primitive that enforces the ordering except for a kernel launch and completion… which is what you’re deliberately avoiding.

There is a GPU -> CPU synchronization, __threadfence_system(). (See section B.5 of the programming guide). It’s only for Fermi.

My running-kernel tool works but it’s got way more latency than I expected. And atomic updates work but fail sometimes (1 in 50000 times) and you can get double atomic increments or skipped values… Again this is NOT an error, it’s not even suggested that atomic accesses should work at all.

I should write up my library… it’s actually quite useful but a little too complex for a forum post.

Q1) There is a confusion here… One way this could fail is if the gpu has a cached version of some zero copy memory location in global memory… Then if you change it on the cpu, the caches may not be invalidated while the kernel is running… While the volatile keyword guarantees this behaviour by generating global memory accesses for cached operands, It doesnt say anything about accesses to zero copy memory that may be cached in global memory… Does the volatile keyword work in this case as well?

Q2) Is there a similar sync primitive on the cpu side as well? Im interested only in the ordering, as I will constantly poll from the gpu…I mean… If i fill out a buffer from the cpu, and then set a flag which is polled by the gpu, I would want the buffer to be seen by the gpu before the flag is set…

Zero copy atomics work for you most of the time? Thats nice…

And i would love to see your library out sometime…It would be nice to see the limits of what can be achieved with cuda.

Cheers!

Debdatta Basu

Q1) There is a confusion here… One way this could fail is if the gpu has a cached version of some zero copy memory location in global memory… Then if you change it on the cpu, the caches may not be invalidated while the kernel is running… While the volatile keyword guarantees this behaviour by generating global memory accesses for cached operands, It doesnt say anything about accesses to zero copy memory that may be cached in global memory… Does the volatile keyword work in this case as well?

Q2) Is there a similar sync primitive on the cpu side as well? Im interested only in the ordering, as I will constantly poll from the gpu…I mean… If i fill out a buffer from the cpu, and then set a flag which is polled by the gpu, I would want the buffer to be seen by the gpu before the flag is set…

Zero copy atomics work for you most of the time? Thats nice…

And i would love to see your library out sometime…It would be nice to see the limits of what can be achieved with cuda.

Cheers!

Debdatta Basu

Bump!
Sorry for that, but it will be great if someone answers this… To be specific, I need answers to q1 and q2 on the previous post… Ive written basic paging code and it works… didnt crash on me or give errors yet… But I dont really have a simulation farm to test millions of cases… ;)

Debdatta Basu

Are you talking about a user CUDA kernel actively caching zero copy memory in global memory or are you suggesting that the GPU might itself cache zero copy memory in global memory? As I understand it the latter simply doesn’t happen (at least not on any hardware to date).

I was suggesting that the gpu itself might cache zero copy memory locations in global memory… If this does not happen, then running kernels should be able to access cpu side updates to zero copy memory right?

can somebody from the driver team confirm this?

Thank You,

Debdatta Basu

The documentation says nothing about CUDA’s volatile keyword and zero-copy, so nothing is guaranteed.

But in practice, as long as you’re doing a volatile read, zero copy seems to read the updated values as you’d expect.

(If you look at the PTX level, you can get a guarantee of this by using ld.cv loads which are designed to see any CPU updated words. This is not a synchronization, it’s just a promise that the caches won’t get in your way and obscure changes, therefore allowing polling. See table 81 in the PTX manual in the CUDA toolkit.)

Yes. It’s cudaThreadSynchronize().

That’s not the answer you want to hear, but it’s the only synchronization primitive on the CPU side.

But in practice, you can at use the GPU polling… Something like:

CPU:

  1. First write lots of interesting data into zero-copy.

  2. Write a special “I’m done” token into zero-copy. This may be an incrementing counter to mark it as unique

GPU:

  1. Poll for the “I’m done” token using ld.cv reads (in CUDA, probably just volatile is enough, check the PTX.)

  2. After getting the token, read the interesting data from the CPU.

Polling on the GPU sucks, though. It’s just not designed for it and it’s easy to waste huge fractions of your throughput on spins.

The only hint I have is to use __threadfence_system() as a no-op… it’s the “slowest” operation you can perform in CUDA (longest latency), which is bizarrely what you want, since you want your polling thread to get as little scheduled time as possible.

Again I stress that any time you try polling/spinning locks or synchronization on the GPU, you’re asking for trouble. I just don’t recommend it.

You have to do some sort of store fence on the CPU side in order for that to work reliably.

Oh and this is an area where I’ve never seen a practical application (I’ve written some proof of concept code that does it, but I’ve never seen anyone figure out what it’s useful for).

@Worley

That was exactly what I was looking for… Thanks!

The _threadfence_system tip was great! :)

In my case it wont really be needed though… Im running a pipelined kernel using persistent threads, so I can just switch to another portion of the kernel if the data is not available… In practice I think this will work well as the number of tasks that need outside data are less compared to the total number of tasks…

I could try to partition the tasks according to the data they need beforehand, but this will be difficult, and require a sorting pass… It will also kill some of the persistent kernel advantages…

@Murray

Can you give me an example of such a fence function?

Cheers!

I know it involved at least an SFENCE, I seem to recall there was one other component. However, I don’t remember what it is–I can check tomorrow.

Thats a standard memory fence… Yeah… should work… I thought it would need something special… I wonder what the other component is…

In any case… it can wait until tomorrow ;)

http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1039

A publication on this very topic… Nice… :) I did not find the full text though… maybe coz its very recent…