Offload RAID XOR to GPU? Can CUDA technologies be used for RAID effectively?

I’m in the process of developing a user mode block driver for Windows which works similarly to BeyondRAID (I wrote a paper on the tech long before BeyondRAID came around) and it combines RAID 0, RAID 5 and RAID 6 technologies to maximize reliability of hard drives at a block based level instead of making use of technologies like RAID-Z which requires a full file system implementation.

The goal of the technology is to make RAID a technology within the reach of consumers for home media servers and the such. I also wish to perform this function in software since then hard drive controllers can be mixed and matched as opposed to needing to purchase a high port count controller which has become prohibitively expensive over time.

For a home media server to be useful these days, it needs to run on extremely inexpensive hardware. An ION based NVidia motherboard would be an ideal system for a home RAID. The problem is, to achieve even reasonable performance, either an ASIC or a GPU based solution would be needed. Implementing this using an ASIC is easy enough and by selling a board through a chinese vendor like deal extreme would be easy enough. But in reality, it’s a far less than perfect solution.

Therefore, GPU is the way to go if it’s practical. I’ve done very limited GPU computing programming. Thus far, it has been limited to OpenGL based fragment shaders and the such, and that of course has been floating point based.

  1. Would it be practical and more importantly beneficial to code (in CUDA for example), an engine for offloading XOR block operations from the CPU effectively making the GPU a RAID coprocessor?

The CPU would handle all the block based processing, things like “what do I store where”, create a job, asynchronously pull the needed blocks from the hard drives needed to perform the XOR, then push the data to the graphics card several blocks at a time. Then when the job is done, the GPU would signal the CPU of the job completion and asynchronously write the data to the drive.

The performance of this would depend highly on whether the GPU is suitable for running large loops of integer operations on system memory. In the case of the ION based Atom boards, I imagine that the system memory and graphics memory is the same memory. Also, because the graphics memory needs to be accessed in a clock sensitive manor, the GPU is able to read it quite smoothly “DMA style”.

  1. Can I pass a pointer to system memory to the GPU and operate directly on that?

  2. Are there “CPU/GPU” tools for synchronization? Can I signal the CPU when a job is finished?

Thanks in advance,

  • Darren

You are right that this is only practical for an Ion system, where the chipset can map system memory into the GPU address space with no copy required. On any other CUDA system where all GPU I/O has to happen over PCI-Express, I think the overhead of copying data from system memory to the GPU, then GPU results back to system memory and out again to the disk would kill this solution.

Specifically, I think you will need the first generation Ion, using the GeForce 9400M, since Ion 2 is designed for the current Atom chips, which pull the memory controller onto the CPU, negating the benefit of direct access to system memory from the GPU. (I think… please correct me if I have the architecture wrong…)

So, given an Ion system, you can share memory directly between the CPU (“host”) and the GPU (“device”):

  • Allocate memory on the host using cudaHostAlloc() and the cudaHostAllocMapped flag. cudaHostAlloc will give you a block of system memory that has been “page locked,” which means that the operating system is not allowed to move its physical location, which is important for DMA. The flag further indicates that you want to have this block mapped into the address space of the device.

  • Use the cudaHostGetDevicePointer() function to obtain the device address of your mapped block. This is the pointer you pass to your kernel for access to memory on the device.

Note that, in addition to Ion, you can also map memory on pretty much any of the GTX 200-series cards (and Fermi-based devices), although with discrete cards, the mapping means that device reads and writes are sent over the PCI-Express bus to the system memory, which is probably not good enough for this application.

As far as CPU-GPU synchronization, ultimately you are stuck with polling for kernel completion in your host code, rather than receiving some kind of signal. There is a way to perform a non-blocking poll, and with a blocking poll, you have the option (using cudaSetDeviceFlags()) to specify whether you want to spin or yield the CPU thread while waiting for the GPU to finish. The spin option has lower latency, but yield uses less CPU time.

Also, one last comment: CUDA is really only intended for use from user space, so as long as your plan is to manage the RAID from a daemon, rather than a OS kernel module, CUDA should be a viable option.

(I haven’t heard of anyone trying this idea out on the Atom, so definitely let us know how the experiment turns out!)

Paper on accelerating RAID with CUDA…wCurryPaper.pdf

So even with PCI-Express overhead, this still works pretty well. Good to know!

With due respect to the paper and their hardwork…,

300$ inexpensive GPUs to generate parity… Sounds good… but those 300$ GPU themselves have no guarantee against bit errors (No ECC).
That looks to be the biggest loophole of that paper.

I have seen enough major RAID card from major manufacturer crashing or failing, including data loss or corruption, usually due to heir memory subsystems, that I am not sure that using a GPU is a problem!

Moreover, if you look at the needed GPGPU processing power to compete on a 1GB/s bandwidth (says a big bunch of physical hard-drives, or a handful of SSD), you’d probably talking about sub-$100 GPU card.

And finally, These GPU will come with probably 512MB to 1GB video memory (DDR2 or GDDR3), memory that could be used as cache for the RAID system itself, and CUDA RAID software will have spare time to do CRC on it to ensure accuracy and stability ^_^

And one could do more complicated forward error correction on the graphics chip. Simple parity methods are so 1960…

Reed-Solomon codes should work nicely, as they operate on octets (bytes).