Parallel CRC Generator
![]()
Click here to download a full version of this article
Every modern communication protocol uses one or more error detection algorithms. Cyclic Redundancy Check, or CRC, is by far the most popular one. CRC properties are defined by the generator polynomial length and coefficients. The protocol specification usually defines CRC in hex or polynomial notation. For example, CRC5 used in USB 2.0 protocol is represented as 0×5 in hex notation or as G(x)=x5+x2+1 in the polynomial. This CRC is implemented in hardware as a shift register as shown in the following picture.
The problem is that in many cases shift register implementation is suboptimal. It only allows the calculation of one bit every clock. If a design has 32-bit wide datapath, meaning that every clock CRC module has to calculate CRC on 32-bit of data, this scheme will not work. Somehow this serial shift register implementation has to be converted into a parallel N-bit wide circuit, where N is the design datapath width, so that every clock N bits are processed.
I started researching the available literature on parallel CRC calculation methods and found only a handful of papers ([2], [3]) that deal with this issue. Most sources are academic and focus on the theoretical aspect of the problem. They are too impractical to implement in software or hardware for a quick code generation.
I came up with the following scheme that I’ve used to build an online Parallel CRC Generator tool. Here is a description of the steps in which I make use USB CRC5 mentioned above. 
(1) Let’s denote N=data width, M=CRC width. For example, if we want to generate parallel USB CRC5 for 4-bit datapath, N=4, M=5.
(2) Implement serial CRC generator routine using given polynomial or hex notation. It’s easy to do in any programming language or script: C, Java, Perl, Verilog, etc.
(3) Parallel CRC implementation is a function of N-bit data input as well as M-bit current state CRC, as shown in the above figure. We’re going to build two matrices: Mout (next state CRC) as a function of Min(current state CRC) when N=0 and Mout as a function of Nin when M=0.
(4) Using the routine from (2) calculate CRC for the N values when Min=0. Each value is one-hot encoded, that is there is only one bit set. For N=4 the values are 0×1, 0×2, 0×4, 0×8. Mout = F(Nin,Min=0)
(5) Build NxM matrix, Each row contains the results from (3) in increasing order. For example, 1’st row contains the result of input=0×1, 2′nd row is input=0×2, etc. The output is M-bit wide, which the desired CRC width. Here is the matrix for USB CRC5 with N=4.

(6) Each column in this matrix, and that’s the interesting part, represents an output bit Mout[i] as a function of Nin.
(7) Using the routine from (3) calculate CRC for the M values when Nin=0. Each value is one-hot encoded, that is there is only one bit set. For M=5 the values are 0×1, 0×2, 0×4, 0×8, 0×10. Mout = F(Nin=0,Min)
(8) Build MxM matrix, Each row contains the results from (7) in increasing order. Here is the matrix for USB CRC5 with N=4
(9) Now, build an equation for each Mout[i] bit: all Nin[j] and Min[k] bits in column [i] participate in the equation. The participating inputs are XORed together.
Mout[0] = Min[1]^Min[4]^Nin[0]^Nin[3]
Mout[1] = Min[2]^Nin[1]
Mout[2] = Min[1]^Min[3]^Min[4]^Nin[0]^Nin[2]^Nin[3]
Mout[3] = Min[2]^Min[4]^Nin[1]^Nin[3]
Mout[4] = Min[0]^Min[3]^Nin[2]
That is our parallel CRC.
I presume since the invention of the CRC algorithm more than 40 years ago, somebody has already came up with this approach. I just coulnd’t find it and “reinvented the wheel”.
Keep me posted if the CRC Generation tool works for you, or you need more clarifications on the algorithm.
![]()
Click here to download a full version of this article
References
- CRC on Wikipedia
- G. Campobello, G Patane, M Russo, “Parallel CRC Realization”
- W Lu, S. Wong, “A Fast CRC Update Implementation”







Hello! Recently,I’m studying the CRC,your work is very helpful to me,your explanation of the algorithm is also very clear.But I have a question about the tool “crc-gen.exe”,I’ve been taught that when the shift register is initialized for all-ones,Before the algorithm to deal with,the first n-bit message data bits must be reverse,but not the case in crc-gen.exe,I want to know Why.I hope that you have time to e-mail me.Thanks!
Thanks for the interest in my parallel CRC generator.
I think there is no connection between initializing the LFSR with ‘1’s and reversing the data bits. Non-zero initialization is intended for detecting leading ‘0′ bit sequences in the data.
Reversing data bits is a protocol-specific way to calculate CRC. I think it’d be incorrect to reverse data bits in the first n-bit message and not in subsequent ones.
Can you please point me to the source that tells doing that.
Thanks for your answer. I can understand that all-ones initialization is intended for detecting leading ‘0′ bit sequences in the data,but how to understand the result which is different from the remainder of direct long division operation. so I think this is the reason it reverse the first n-bit data bits.
the source is come from the chinese version of CRC on Wikipedia.
Thanks!
Hi,
There is a much simpler way to do this.
e.g. in VHDL, write a function to calculate 1 bit of the CRC. Then, inside a clocked process, run a for loop and call this function multiple times. e.g. if you want to generate 8 bits of CRC per clock, the function gets called 8 bits. This synthesises and meets timing just fine.
You will find actual code by Mike Treseler on comp.arch.fpga to do this. It works well.
Cheers
Andrew
Hi Andy,
You’re right, it can be done that way. The synthesis tool will then “unroll” the for loop and generate XOR tree. However, I found that directly coded XOR tree produces more compact logic, at least in case of Xilinx XST. More details are in this document.
Thanks,
Evgeni
Hi,
It is easy to understand and nicely written. Just have a basic doubt while going through the Matrix generation for parallel CRC. Logic for NxM metrix is clear. But bit confused on MxM matrix. It is of width of required CRC. While geneting MxM matrix do we need to consider the data width?(How it would reflect here) .
I tried to gererate logic for Parallel CRC with two combinations
1. N=4 and M=5
2. N=8 and M=5 Same equation just for a sek of understanding.
Here in both the cases MxM matix would be different though we are using same equation/polynomial. Can you please help to understand how to take care of it.
Hi Manesh,
MxM matrix is a function of the next CRC state given the current state, while data bits are zero. Still, it does depend on the data width N. That’s because serial CRC is calculated N times after the initialization.
Thanks,
Evgeni
Hi,
I saw your paper.It is quite useful.
I am working on the BCH code.In that I wanted to apply my information bits as a byte wise.Do you have any idea regarding that.Please reply me if you know anything about how to implement Bch code.
Thanks in advance
Krupesh
Hi Krupesh,
I’d expect the approach I describe also applies to the BCH. However, I don’t know BCH deep enough to be certain.
Thanks,
Evgeni
Thank you very much for replying me.
I have one more question.Suppose I have 16 bytes of data on which I want to apply CRC.I am using 8 bit parallel CRC then how to implement this.And Should I calculate my CRC polynomial according to 16 bytes of data or 1 byte of data.
Regards,
Krupesh
Hi,
One more thing I want to ask.I saw your code.I simulated it in modelsim.I gave four bits inputs and I checked the outputs.After that I created a serial model for your code.And I gave inputs same as before but serially.Now I am getting the outputs which is different from previous one.I think both results should same.May be I did some mistake.So would you please check this.
Regards,
Krupesh
Hi Krupesh,
You specify 8-bit data width and it’d take 16 clocks to calculate such a CRC, 1-byte at a time.
Thanks,
Evgeni
@krupesh
Yes, serial and parallel CRC results should be the same.
Thanks,
Evgeni
@Evgeni
I’m studying the CRC, your website is very helpful to me.
I have a few question about the CRC, Could you please give me some help?
——————–
Here is my problem.
1. lfsr(3:0)=1+x^3+x^4,
2. bitstream=”1011001″,
3. CRC=”1010″
Theoretically, if the input is “10110011010″, and after all these bits is shifted into CRC module, the CRC output will be zero. But, when I initiate the lfsr_q register with “1111″, the output will be not-zero; when I initiate the register with “0000″, the output will be all-zeros.
I am still confused about that all-one initiation, Could you please give me some help about it?
Another question is about the parallel CRC, if the datawidth is 8, data_in(0) is msb or lsb?
Thanks a lot!
Hi,
(1) The reason CRC LFSR is usually initialized with ‘1’s is to prevent the situation when all-zero data, or a long stream of leading zeros will cause the LFSR value not to change (we do want it to change for every data bit).
(2) The bit order is implementation-specific. But it’s not important in general implementation as soon as CRC generator on the tx side and CRC checker on the rx side follow the same order.
Hope that helps,
Evgeni
hi,
i am working on programmable crc i.e. so that i can change crc_key every time
in short i want to provide input data ,crc_key as input and want computed crc at output
i wrote a code but its sort of recursive involving XOR operation with key on data and shifting
do you have any idea how can i increase parallelism it further
Hi,
For small enough CRCs you can use a look-up table approach to make it programmable. But this will not scale well for larger CRCs.
Evgeni
but in look up table based aproach we need to create look up table at run time when ever key changes that consumes some time ,comparing that with the total computation time this aproach will deliver the same performance
Hi,
I was doing a project regarding this.Since we are taking data of width 8 every time,So 2^8=256 combinations.As for Crc generation we have to append n zeroes to the data width and divide it by divisor polynomial of n+1 bits.So once we calculate manually the 256 remainders and store in a look up table and everytime for certain data combination,remainder particular to that can be the output crc.Can this technique work faster?
Hi,
Which part of the technique do you want to work faster?
The one with storing all remainders in lookup table.