Has anyone tried out CRC encoder or modulo 2 division in GPU?
You might want to do a literature search on parallel algorithms for polynomial division, as both CRC check and the BCH encoder you asked about earlier are specific applications of this generic algorithm.
I don’t know your use case, but from the work with CRC I did decades ago, I have doubts whether there is enough inherent parallelism given typical bit-string lengths in real-life scenarios to warrant the use of a GPU.
Today, most CRC encoders use lookup tables and are very fast. That is, for CRC-32 and CRC-16. You don’t have to do the shift, ex-or, and mask operations for every bit of every byte any more. Now you look up a value in a table and ex-or/add that with the running result. I don’t think it would gain very much from parallel processing.
As a historical reference, the granddaddy of LUT-based CRC computations is:
A. Perez, “Byte-Wise CRC Calculations,” IEEE Micro, vol. 3, no. 3, pp. 40-50, June 1983.
I used a refinement of this method back when I worked with CRC check sums in the late 1980s. When SIMD instructions first came around, adaptations of the LUT-approach were published which achieved decent speedup over scalar code. Around the turn of the century, more aggressive parallelization methods using Galois field multiply-accumulate were published.
We don’t know what the OP’s use case looks like. Maybe there is additional parallelism in terms of batching CRC computations.