GPU for video encoding

I am interested in accelerating video encoder (MPEG2 or H264) speed using GPU. I think motion estimation is a good candidate for this, but I got stuck at choosing a proper motion estimation algorithm, both accurate and suitable for GPU implementation at good speed. Cuda would be used for the implementation.

There is some literature regarding motion estimation implementation on GPU, most of them using gradient ME, or block ME (I only found full search results, the faster algorithms don’t seem to map very well to GPU architecture).

I would appreciate some hints to help me chose such an algorithm. the result should be faster then CPU implementation. (2 or 4 cores, using SSE2, SSE3, or even SSE4 later this year).

I was considering full search, 4SS, hierarchical, phase correlation, gradient, DCT domain ME.

I know also that NVidia GPUs have a video processor. Is it usable by programmers(through CUDA, DXVA, etc)?

Thanks for any info.