Parallel CRC Generator
![]()
Click here to download a full version of this article
Every modern communication protocol uses one or more error detection algorithms. Cyclic Redundancy Check, or CRC, is by far the most popular one. CRC properties are defined by the generator polynomial length and coefficients. The protocol specification usually defines CRC in hex or polynomial notation. For example, CRC5 used in USB 2.0 protocol is represented as 0×5 in hex notation or as G(x)=x5+x2+1 in the polynomial. This CRC is implemented in hardware as a shift register as shown in the following picture.
The problem is that in many cases shift register implementation is suboptimal. It only allows the calculation of one bit every clock. If a design has 32-bit wide datapath, meaning that every clock CRC module has to calculate CRC on 32-bit of data, this scheme will not work. Somehow this serial shift register implementation has to be converted into a parallel N-bit wide circuit, where N is the design datapath width, so that every clock N bits are processed.
I started researching the available literature on parallel CRC calculation methods and found only a handful of papers ([2], [3]) that deal with this issue. Most sources are academic and focus on the theoretical aspect of the problem. They are too impractical to implement in software or hardware for a quick code generation.
I came up with the following scheme that I’ve used to build an online Parallel CRC Generator tool. Here is a description of the steps in which I make use USB CRC5 mentioned above. 
(1) Let’s denote N=data width, M=CRC width. For example, if we want to generate parallel USB CRC5 for 4-bit datapath, N=4, M=5.
(2) Implement serial CRC generator routine using given polynomial or hex notation. It’s easy to do in any programming language or script: C, Java, Perl, Verilog, etc.
(3) Parallel CRC implementation is a function of N-bit data input as well as M-bit current state CRC, as shown in the above figure. We’re going to build two matrices: Mout (next state CRC) as a function of Min(current state CRC) when N=0 and Mout as a function of Nin when M=0.
(4) Using the routine from (2) calculate CRC for the N values when Min=0. Each value is one-hot encoded, that is there is only one bit set. For N=4 the values are 0×1, 0×2, 0×4, 0×8. Mout = F(Nin,Min=0)
(5) Build NxM matrix, Each row contains the results from (3) in increasing order. For example, 1’st row contains the result of input=0×1, 2′nd row is input=0×2, etc. The output is M-bit wide, which the desired CRC width. Here is the matrix for USB CRC5 with N=4.

(6) Each column in this matrix, and that’s the interesting part, represents an output bit Mout[i] as a function of Nin.
(7) Using the routine from (3) calculate CRC for the M values when Nin=0. Each value is one-hot encoded, that is there is only one bit set. For M=5 the values are 0×1, 0×2, 0×4, 0×8, 0×10. Mout = F(Nin=0,Min)
(8) Build MxM matrix, Each row contains the results from (7) in increasing order. Here is the matrix for USB CRC5 with N=4
(9) Now, build an equation for each Mout[i] bit: all Nin[j] and Min[k] bits in column [i] participate in the equation. The participating inputs are XORed together.
Mout[0] = Min[1]^Min[4]^Nin[0]^Nin[3]
Mout[1] = Min[2]^Nin[1]
Mout[2] = Min[1]^Min[3]^Min[4]^Nin[0]^Nin[2]^Nin[3]
Mout[3] = Min[2]^Min[4]^Nin[1]^Nin[3]
Mout[4] = Min[0]^Min[3]^Nin[2]
That is our parallel CRC.
I presume since the invention of the CRC algorithm more than 40 years ago, somebody has already came up with this approach. I just coulnd’t find it and “reinvented the wheel”.
Keep me posted if the CRC Generation tool works for you, or you need more clarifications on the algorithm.
![]()
Click here to download a full version of this article
References
- CRC on Wikipedia
- G. Campobello, G Patane, M Russo, “Parallel CRC Realization”
- W Lu, S. Wong, “A Fast CRC Update Implementation”







Hello! Recently,I’m studying the CRC,your work is very helpful to me,your explanation of the algorithm is also very clear.But I have a question about the tool “crc-gen.exe”,I’ve been taught that when the shift register is initialized for all-ones,Before the algorithm to deal with,the first n-bit message data bits must be reverse,but not the case in crc-gen.exe,I want to know Why.I hope that you have time to e-mail me.Thanks!
Thanks for the interest in my parallel CRC generator.
I think there is no connection between initializing the LFSR with ‘1’s and reversing the data bits. Non-zero initialization is intended for detecting leading ‘0′ bit sequences in the data.
Reversing data bits is a protocol-specific way to calculate CRC. I think it’d be incorrect to reverse data bits in the first n-bit message and not in subsequent ones.
Can you please point me to the source that tells doing that.
Thanks for your answer. I can understand that all-ones initialization is intended for detecting leading ‘0′ bit sequences in the data,but how to understand the result which is different from the remainder of direct long division operation. so I think this is the reason it reverse the first n-bit data bits.
the source is come from the chinese version of CRC on Wikipedia.
Thanks!
Hi,
There is a much simpler way to do this.
e.g. in VHDL, write a function to calculate 1 bit of the CRC. Then, inside a clocked process, run a for loop and call this function multiple times. e.g. if you want to generate 8 bits of CRC per clock, the function gets called 8 bits. This synthesises and meets timing just fine.
You will find actual code by Mike Treseler on comp.arch.fpga to do this. It works well.
Cheers
Andrew
Hi Andy,
You’re right, it can be done that way. The synthesis tool will then “unroll” the for loop and generate XOR tree. However, I found that directly coded XOR tree produces more compact logic, at least in case of Xilinx XST. More details are in this document.
Thanks,
Evgeni
Hi,
It is easy to understand and nicely written. Just have a basic doubt while going through the Matrix generation for parallel CRC. Logic for NxM metrix is clear. But bit confused on MxM matrix. It is of width of required CRC. While geneting MxM matrix do we need to consider the data width?(How it would reflect here) .
I tried to gererate logic for Parallel CRC with two combinations
1. N=4 and M=5
2. N=8 and M=5 Same equation just for a sek of understanding.
Here in both the cases MxM matix would be different though we are using same equation/polynomial. Can you please help to understand how to take care of it.
Hi Manesh,
MxM matrix is a function of the next CRC state given the current state, while data bits are zero. Still, it does depend on the data width N. That’s because serial CRC is calculated N times after the initialization.
Thanks,
Evgeni
Hi,
I saw your paper.It is quite useful.
I am working on the BCH code.In that I wanted to apply my information bits as a byte wise.Do you have any idea regarding that.Please reply me if you know anything about how to implement Bch code.
Thanks in advance
Krupesh
Hi Krupesh,
I’d expect the approach I describe also applies to the BCH. However, I don’t know BCH deep enough to be certain.
Thanks,
Evgeni
Thank you very much for replying me.
I have one more question.Suppose I have 16 bytes of data on which I want to apply CRC.I am using 8 bit parallel CRC then how to implement this.And Should I calculate my CRC polynomial according to 16 bytes of data or 1 byte of data.
Regards,
Krupesh
Hi,
One more thing I want to ask.I saw your code.I simulated it in modelsim.I gave four bits inputs and I checked the outputs.After that I created a serial model for your code.And I gave inputs same as before but serially.Now I am getting the outputs which is different from previous one.I think both results should same.May be I did some mistake.So would you please check this.
Regards,
Krupesh
Hi Krupesh,
You specify 8-bit data width and it’d take 16 clocks to calculate such a CRC, 1-byte at a time.
Thanks,
Evgeni
@krupesh
Yes, serial and parallel CRC results should be the same.
Thanks,
Evgeni
@Evgeni
I’m studying the CRC, your website is very helpful to me.
I have a few question about the CRC, Could you please give me some help?
——————–
Here is my problem.
1. lfsr(3:0)=1+x^3+x^4,
2. bitstream=”1011001″,
3. CRC=”1010″
Theoretically, if the input is “10110011010″, and after all these bits is shifted into CRC module, the CRC output will be zero. But, when I initiate the lfsr_q register with “1111″, the output will be not-zero; when I initiate the register with “0000″, the output will be all-zeros.
I am still confused about that all-one initiation, Could you please give me some help about it?
Another question is about the parallel CRC, if the datawidth is 8, data_in(0) is msb or lsb?
Thanks a lot!
Hi,
(1) The reason CRC LFSR is usually initialized with ‘1’s is to prevent the situation when all-zero data, or a long stream of leading zeros will cause the LFSR value not to change (we do want it to change for every data bit).
(2) The bit order is implementation-specific. But it’s not important in general implementation as soon as CRC generator on the tx side and CRC checker on the rx side follow the same order.
Hope that helps,
Evgeni
hi,
i am working on programmable crc i.e. so that i can change crc_key every time
in short i want to provide input data ,crc_key as input and want computed crc at output
i wrote a code but its sort of recursive involving XOR operation with key on data and shifting
do you have any idea how can i increase parallelism it further
Hi,
For small enough CRCs you can use a look-up table approach to make it programmable. But this will not scale well for larger CRCs.
Evgeni
but in look up table based aproach we need to create look up table at run time when ever key changes that consumes some time ,comparing that with the total computation time this aproach will deliver the same performance
Hi,
I was doing a project regarding this.Since we are taking data of width 8 every time,So 2^8=256 combinations.As for Crc generation we have to append n zeroes to the data width and divide it by divisor polynomial of n+1 bits.So once we calculate manually the 256 remainders and store in a look up table and everytime for certain data combination,remainder particular to that can be the output crc.Can this technique work faster?
Hi,
Which part of the technique do you want to work faster?
The one with storing all remainders in lookup table.
Hi, I’m not very at ease with english so please excuse my mistakes. I want to make a CRC code in VHDL following the CCITT protocol it means the polynum must be x^16+x^12+x^5+1 and i’m lost trying to apply your algo to this polynum. My entry data width is 24 bits. Could you please tell me the logic that changes regarding the CRC16 USB2.2 protocol?? I would appreciate if you could send me the code at my adress maxlamenace04[at]hotmail[dot]com for me to try to understand the changes.
Hi Max,
Just select 24 in the data width, 16 in the poly width, and check the correct polynomial values: 1,5,12.
Thanks,
Evgeni
thanks, it works i thought only some protocol were available. Thank you very much really, it’s a very convenient tool you made.
hey, i have a question regarding CRC32 calculation when input data is a 1 byte long. I am using your code in my project but i can’t seem to get the right crc i guess.
1- Do i need to reverse the bit order of each input byte or not ?
2- Do i need to complement the final crc value before transmitting ?
thanks
Hi Hassan,
There are different variants of CRC32, and each one has its implementation specifics. It’s usually defined in the protocol you use (Ethernet, PCI, etc.) whether to reverse the bits or complement the final CRC.
Thanks,
Evgeni
Thanks Evgeni, i am generating ethernet’s crc and i have read it on many sites that you have to reverse the bit order,on the contrary there is an equal number of sites saying that you dont need to reverse the bit order. But anyway, thanks a lot, your site is of great help to me =)
Hi, I have a question about the figure of LFSR in CRC HW. In your article, “A Practical Parallel CRC generation Method”, the LFSR for CRC5 implementaion is designed in a different way from the way Wikipedia has explained. The URL is as follows,
http://en.wikipedia.org/wiki/Computation_of_CRC#Parallel_computation_without_table.
Accoring to this website, data bit affects the lsb of crc value only. However, the LFSR in your article doesn’t. I’m very confused.. Which one is right?
Thanks,
Jeongsup Lee
Hi Jeongsup,
I’m not sure they’re different: in both the data bit is fed into an XOR gate. Perhaps you’re referring to the Galois vs Fibonacci implementation difference.
Thanks,
Evgeni
Hi Evgeni
I’m having some problems implementing the CRC32 with datawidth= 64 bits in a Spartan6.
My problem are the constraints not met. I’m trying to put it working at 160MHz but it don’t reach that.
Is it possible to pipeline a CRC? That would resolv my problem but I don’t know how to implement it.
Hi Chico,
I worked with CRC32 and datawidth=256 bit that run at 133MHz on Virtex4. I’d expect your configuration to meet timing. But before you resort to pipelining, I’d recommend trying to floorplan it.
Thanks,
Evgeni
Hi,
I am trying to understand CRC generation using the shift register, could someone help to explain how it works, thanks.
regards
Essennell
Hi,
I think a good place to start with understanding a 1-bit CRC is on wiki: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
Thanks,
Evgeni
Hi,
A quick question, I’ve just read your PDF “circuit-cellar-january-2010-crc.pdf”. And you’ve explained to me privously that you use Galois notation.
At the step where you generate the matrixes with your software LFSR implementation, which i guess is implemented with Galois notation. Is it possible to use a Fibonacci notation insted?
And will the reset of the proceedure still hold ?
Hi,
Galois and Fibonacci notations are used to describe linear feedback shift registers, which are used for CRC with 1-bit data.
Parallel CRC in my method is using XOR trees, which is a structure different from LFSR. It reduces to Galois-style LFSR for 1-bit data.
Thanks,
Evgeni
Ok,
Im not completly clear if that answer is a yes or no
. However, i know that the XOR trees are not a LFSR. My question, dont know if i have been unclear, is that the matrixes you use to calculate thoes XOR tress are based on a calculation made over a software implementation of a Galois LFSR. My question is will the same method work if the Matrix is made by a software implemented Fibonacci LFSR insted ?
Hi,
The method should work with either Fibonacci or Galois LFSRs. I expect that both will produce the same result – matrix coefficients, or XOR tree – but will need to work out a formal proof for that claim.
Thanks,
Evgeni
Hi,
I am working on CRC 32 bit project which has 120 bit or 15 byte data as serial in stream coming on an ethernet cable on the receiver side.i have the program and am not able to see the right crc in the simulation.i saw the program on your site but seems to long to get the crc32.the logic is as below:
ENTITY TL_CRC_32 IS
PORT (
clk : IN STD_LOGIC; — clock
reset : IN STD_LOGIC; — asynchronous reset
crc32_init : IN STD_LOGIC; — set CRC register to initial value
crc32_enable : IN STD_LOGIC; — enable CRC calculation
serial_IN : IN STD_LOGIC; — serial input data
crc32 : OUT bus32); — CRC32 accumolator
END ENTITY TL_CRC_32;
ARCHITECTURE behaviour OF TL_CRC_32 IS
SIGNAL crc32_shifter : STD_LOGIC_VECTOR (31 DOWNTO 0);
–SIGNAL temp : STD_LOGIC_VECTOR (31 DOWNTO 0);
SUBTYPE poly_vector IS BIT_VECTOR (31 DOWNTO 0);
CONSTANT poly : poly_vector := X”04C11DB7″; — polynominal
CONSTANT initval : STD_LOGIC := ‘1′; — initial crc reg. value
BEGIN
shift_data : PROCESS (clk, reset) IS
BEGIN
IF reset = ‘0′ THEN
crc32_shifter initval);
ELSIF clk’EVENT AND clk = ‘1′ THEN
IF crc32_init = ‘1′ THEN — load crc reg with init value
crc32_shifter initval);
ELSIF (crc32_enable = ‘1′) THEN — serial crc bit calculation
FOR i IN 0 TO 31 LOOP
IF (poly(i) = ‘1′) THEN
IF (i = 0) THEN
crc32_shifter(0) <= serial_IN XOR crc32_shifter(31);
ELSE
crc32_shifter(i) <= crc32_shifter(i-1) XOR serial_IN XOR crc32_shifter(31);
END IF;
ELSE
crc32_shifter(i) <= crc32_shifter(i-1);
END IF;
END LOOP;
END IF;
END IF;
END PROCESS shift_data;
crc32 <= crc32_shifter;
END ARCHITECTURE behaviour;
Can you please help me?
i am giving a date packet but not able to see the correct crc
Hi Vikram,
There is a long list of things that can go wrong during CRC calculation. I’d need more information to help you.
Thanks,
Evgeni