Custom CPU to GPU ringbuffer

I have been catching up on all the new stuff in CUDA 4 since I last played with CUDA 3.

I want to know if it is now possible to create a custom CPU to GPU ringbuffer to efficiently send commands to a kernel from the host without going though the CUDA API and it’s associated overheads and latencies. I will be using Linux 64 bit so I can use the unified memory system to make programming easier.

The idea I have so far is to allocate some pinned pages with write combining and map them to the GPU using cudaHostRegister(). I can then write commands and data to the ring buffer using the non caching SSE instructions.

The GPU side will be more difficult as careful management is needed to prevent CUDA from caching system RAM on the GPU. The CUDA synchronize and event commands can do this but I want to avoid using the CUDA API.

From other comments it appears possible to compile code with flags to disable both L1 and L2 caching with -Xptxas -dlcm=cs. How ever to keep performance high I think it will be better to keep caching on and just use inline PTX to generate load instructions that bypass the cache.

Thus I propose using the instruction that seems to be designed for accessing CPU memory as it will always reload the data from system memory.

For the kernel it will have an infinite loop (until an exit command is received) that checks a particular flag byte. When this flag is set the kernel proceeds to read an instruction or data chunk (with the size specified in a particular location) from the current read pointer and process it. Once it has finished processing it will reset the flag.

The CPU simply writes to the buffer from the write pointer, updates the command size, and sets the flag. I am not sure what data size is atomic for both the CPU and GPU so I will likely use a byte to store the flag. The CPU can then wait until the flag is cleared and read back any results. I may come up with an even better design that allows continuous streaming of data without needing to check for flags.

Using a ring buffer will allow all sorts of interesting trickery. For example real time processing of streaming audio/signal data by having the CPU read into one buffer and simultaneously reading back results from another buffer. Another possibility is to find a way to mix this with OpenGL and create a framebuffer system that allows a program to directly draw to the screen via CPU memory writes (like old DOS programs could).

If you launch the kernel with enough threads to fit any expected input size you could emulate CPU kernel launches of different sizes by passing the kernel function parameters and thread size via the ring buffer. The kernel would simply call the specified function with the parameters and use if statements to control the thread size. Is it possible to use function pointers in CUDA to control the function called from the CPU or would a function lookup vector be needed?

Anyway I can’t try this out at the moment but I will try it next week. I imagine most of this stuff strictly speaking may be unsupported or unspecified implementation specific behaviour etc. but I am not worried about that as it is all for personal research etc.

Has anyone done something like this before? Also do the NVIDIA people have anything to say?

I hope the new Kepler chips will support an IOMMU and native sharing of CPU and GPU virtual memory pages. AMD have announced the 7000 series chips will support this. This would allow the GPU to directly access the CPU and process memory allowing truly heterogeneous computing and even more advanced techniques.

I haven’t seen anyone try this kind of live synchronization between the CPU and GPU before, so good luck. :)

You will probably want to also take a look at the memory fence functions in CUDA 4, as those will let you flush writes from a thread up the cache hierarchy to the appropriate level of visibility. __threadfence_system() might be of interest to you.

Some idea, allocate much memory in cpu, and let gpu time to time check the memory via zerocopy or something, each time new area. Define size of cached memory each time, prolly page size, and put each new comand on new page. So next time gpu will read data, it will be non-cached.

Had a similiar idea in my mind for some time but never tried it: Fill a command queue on the CPU side while a kernel is running and polling host mem(possible extension: writing back results in the kernel to host which are in turn processed by the CPU to generate new commands).

I fear the polling on the GPU side might be a performance killer though.

On the CPU side, you might need _mm_sfence() to ensure the GPU does not see partial writes of the command data:

volatile int hostCommandsReadyForGpu;

struct commandData { /* parameters */ }

void addCommand()






On the GPU side, i would poll the hostCommandCounterForGpu counter via mapped memory and compare it to an atomicCounter:

__device__ int gpuCommandCounter;

int commandIdToProcess = atomicAdd(gpuCommandCounter,1);

while (true)


if (commandIdToProcess>=MAX_COUNT)


if (commandIdToProcess < mappedHostCommandsReadyForGpu) // mappedHostCommandsReadyForGpu needs to be read uncached each iteration(volatile)


// do something with the mapped command data



Edit: You would need the __threadfence_system() seibert mentioned to write back results from kernel to host to ensure all data is visible:

// write result[i] to mapped host memory



On the CPU, you would poll: result[i].readyFlag. Consistency of result[i] is guranteed then.

My first test is simply figuring out how to synchronise output data between the CPU and GPU.

At first I simply called a kernel that copied some data to an output array and then made the CPU wait in an infinite loop until the value in the array changed. This failed as the CPU would just sit there in the loop as the array data did not change.

After some fiddling around I discovered that adding a printf statement would suddenly allow the data in the array to change. I then realised that the CUDA driver/runtime must defer the actual launch of the kernel to sometime after your code actually launches it. Normally a program would not notice this as a cudaMemcpy or otherwise would synchronise and force the kernel to be launched.

Calling printf must have allowed the kernel to launch by yielding the CPU to the kernel or whatever code actually launches the kernel. Thus I changed my infinite waiting loop to use sched_yield() and now it works.

Nighthawk13 has the general idea. I have code that does this as a proof of concept somewhere, but the interesting thing is that perf is much worse than you’d expect.

Also you’re doing basically the same thing as what the driver does internally so I don’t really know what this gets you.

Thanks for the input! It is always nice to hear from someone who has insider info.

At this stage I am not too worried about perf I just wanted to do a proof of concept and document here in case anyone else finds this interesting/useful.

I have done some more tests and have verified that the Fermi’s cache needs to be disabled/bypassed for loads and stores to memory that the CPU is acessing. Otherwise Fermi will store stuff in the L2 cache and never send it back to system RAM.

The two ptxas options that enable this is:

-dlcm : default load cache modifier

-dlsm : default store cache modifier

The relevant cache modifiers are cv for loads, and wt for stores (both documented in pg 110-111 of the PTX ISA 2.3 manual). All loads and stores will be directly read and written in system RAM via PCI bus.

Now the only problem with this is that it will disable caching for all memory accesses including stuff that can safely be cached as it is not touched by CPU which will hurt performance. So ideally caching should be left on by default and only disable caching for the relevant loads and stores.

But I cannot figure out how to do this in my .cu file. It seems that it should be possible to use inline PTX asm such as but I do not know how to specify a memory address as an asm input. Using the ‘m’ modifier produces an error from the compiler…

As seibert suggested - won’t a threadfence_system force a flush all the way up the memory hierachy back to system RAM?

You should be able to call it when you need to without disabling caching entirely…

IIRC all zero-copy memory is uncached on L1 but cached on L2, so you have to use __threadfence_system(). Don’t use the compiler options.

That is interesting. Until now I used to think that the L2 cache is integrated with the memory controllers. However, I wouldn’t think there are memory controllers associated with host memory. So I could imagine zero-copy memory being cached in L1 but not L2, but having it the other way around seems odd to me.

__threadfence_system() without the compiler options is not enough. Even if I stick __threadfence_system() between every line of code in the kernel. If I put those compiler options back it starts working again.

I think __threadfence_system() will flush any cached L2 writes back into system RAM but it does not cause future reads to update the L2 cache from system RAM.

What I think is needed is a version of the ‘volatile’ keyword that operates at the host visibility level and not just the thread level (which is what it currently is defined as).

I believe that’s accurate. Basically, it guarantees that zero-copy memory is coherent from the POV of the host CPU or the GPU without any coherence concerns between individual SMs. Effectively it’s equivalent to GT200. (Then again, it’s been a while since I’ve done driver work.)

ljbadenz, mark your CUDA pointers as volatile. I think that’s enough.

edit: found my old code, it uses volatile int* and __threadfence_system() on the GPU side

Aha, you are correct. Someone needs to update section D.2.1.2 of the CUDA programming guide to state that the volatile keyword works at the host/zero copy visibility level on Fermi. The current wording indicates that volatile only works between GPU threads.

Something else strange I noticed:

If your kernel has a pointer to a struct as a parameter, and you mark this parameter volatile it will still cache any memory pointed to by members of the struct. You have to mark all the pointers in the struct as volatile as well. I would have thought that if a struct was volatile all of it’s members also become volatile. Does anyone know what the C/C++ spec says about this?


struct command


  char * in;

  char * out;


__global__ void kernel(volatile command * cmds)



  *out = *in;



does not work properly until you change the struct:

struct command


  volatile char * in;

  volatile char * out;


I think that’s right too–it’s the difference between a volatile pointer to a standard memory location and a pointer to a volatile memory location. volatile int* x versus int * volatile x.

OK that makes sense.

But what about if I now have a variable instead of a pointer in the struct:

struct command


  volatile char * in;

  volatile char * out;

  unsigned len;


__global__ void kernel(volatile command * cmd)



  if (*cmd.len < ...)



The len variable does not get reloaded from system RAM (keeps its old value), even though the memory containing the struct is marked as volatile which would mean that all values in in the struct must be volatile.

ie the compiler should see this:

struct command


  volatile char * volatile in;

  volatile char * volatile out;

  volatile unsigned len;


Actually this is not a reply, but I am trying to make something similar and I have some questions. I thought I shouldn’t open a new thread…

  1. I would like the gpu to read always the memory mapped to gpu where cpu leaves the commands and as soon it finds a command to execute it. The fact that the memory is mapped provides up to date contents? is it possible for the cpu to write a command and gpu not “see” it? (the cpu leaves a command asynchronously)

  2. Is it a good choice to run all the time a kernel that reads the ring buffer?

  3. If the answer to the above question is yes, can I call a kernel from a kernel?for example

    __global__ read_buffer{
        if this

thnx in advance

No, you cannot launch a kernel from within a kernel.

I assumed so, it was worth trying though…
How about question 1 and mapped memory behaviour?