I have been catching up on all the new stuff in CUDA 4 since I last played with CUDA 3.
I want to know if it is now possible to create a custom CPU to GPU ringbuffer to efficiently send commands to a kernel from the host without going though the CUDA API and it’s associated overheads and latencies. I will be using Linux 64 bit so I can use the unified memory system to make programming easier.
The idea I have so far is to allocate some pinned pages with write combining and map them to the GPU using cudaHostRegister(). I can then write commands and data to the ring buffer using the non caching SSE instructions.
The GPU side will be more difficult as careful management is needed to prevent CUDA from caching system RAM on the GPU. The CUDA synchronize and event commands can do this but I want to avoid using the CUDA API.
From other comments it appears possible to compile code with flags to disable both L1 and L2 caching with -Xptxas -dlcm=cs. How ever to keep performance high I think it will be better to keep caching on and just use inline PTX to generate load instructions that bypass the cache.
Thus I propose using the ld.cv instruction that seems to be designed for accessing CPU memory as it will always reload the data from system memory.
For the kernel it will have an infinite loop (until an exit command is received) that checks a particular flag byte. When this flag is set the kernel proceeds to read an instruction or data chunk (with the size specified in a particular location) from the current read pointer and process it. Once it has finished processing it will reset the flag.
The CPU simply writes to the buffer from the write pointer, updates the command size, and sets the flag. I am not sure what data size is atomic for both the CPU and GPU so I will likely use a byte to store the flag. The CPU can then wait until the flag is cleared and read back any results. I may come up with an even better design that allows continuous streaming of data without needing to check for flags.
Using a ring buffer will allow all sorts of interesting trickery. For example real time processing of streaming audio/signal data by having the CPU read into one buffer and simultaneously reading back results from another buffer. Another possibility is to find a way to mix this with OpenGL and create a framebuffer system that allows a program to directly draw to the screen via CPU memory writes (like old DOS programs could).
If you launch the kernel with enough threads to fit any expected input size you could emulate CPU kernel launches of different sizes by passing the kernel function parameters and thread size via the ring buffer. The kernel would simply call the specified function with the parameters and use if statements to control the thread size. Is it possible to use function pointers in CUDA to control the function called from the CPU or would a function lookup vector be needed?
Anyway I can’t try this out at the moment but I will try it next week. I imagine most of this stuff strictly speaking may be unsupported or unspecified implementation specific behaviour etc. but I am not worried about that as it is all for personal research etc.
Has anyone done something like this before? Also do the NVIDIA people have anything to say?
I hope the new Kepler chips will support an IOMMU and native sharing of CPU and GPU virtual memory pages. AMD have announced the 7000 series chips will support this. This would allow the GPU to directly access the CPU and process memory allowing truly heterogeneous computing and even more advanced techniques.