Any performance data on using parallelizable parts of jpeg compression

We have to significantly compress a ton o’ (monochrome) raw image data. If one were to just use the parallelizeable stages of jpeg compression (DCT and run length encoding of the quantized results) and run it on a GPU so each block is compressed in parallel I am hoping that would be very fast and still yeild a very significant compression factor like full jpeg does, e.g. hoping that we could compress a very large number of 8x8 blocks in parallel and still have an excellent compression factor. I guess there is branching in run length encoding so possible the cpu should handle that(?)

Does anyone with more GPU / image compression experience have any idea how this would compare both compression and performance wise over using libjpeg on a CPU? Certainly it will be less compression and faster but I have no idea how significant those factors may be. My knowledge of cuda is still very, very limited (e.g. once gotten something I wrote compiling, but not yet working in cuda due to suddenly getting a job :) ) so I still have zero practical experience with it.

Yes, I know it will depend on the hardware. It would be running on a single board computer with a pci-e slot, though even knowledge about it on a standard PC with a specific nvidia card would be useful.