Home > Logic Design > Parallel Scrambler Generator

Parallel Scrambler Generator

  Scramblers are used in many communicaton protocols such as PCI Express, SAS/SATA, USB, Bluetooth to randomize the transmitted data. To keep this post short and focused I’ll not discuss the theory behind scramblers. For more information about scramblers see [1], or do some googling.  The topic of this post is the parallel implementation of a scrambler generator. Protocol specifications define scrambling algorithm using either hex or polynimial notation. This is not always suitable for efficient hardware or software implementation. Please read my post on parallel CRC Generator about that.

 The Parallel Scrambler Generator method that I’m going to describe has a lot in common with the Parallel CRC Generator. The difference is that CRC generator outputs CRC value, whereas Scrambler generator produces scrambled data. But the internal working of both based on the same principle.

Here is an example of a scrambler with the polynomial G(x) = x16+x5+x4+x3+1



Following is the description of the Parallel Scrambler Generator algorithm:

(1) Let’s denote N=data width, M=generator polynomial width. scrambler21

(2) Implement serial scrambler generator using given polynomial or hex notation. It’s easy to do in any programming language or script: C, Java, Perl, Verilog, etc. 

(3) Parallel Scrambler implementation is a function of N-bit data input as well as M-bit current state of the polynomial, as shown in the above figure. We’re going to build three matrices:

  • Mout (next state polynomial) as a function of Min(current state polynomial) when N=0 and
  • Nout as a function of Nin when Min=0. 
  • Nout as a function of Min when Nin=0

      Note that the polynomial next state doesn’t depend on the scrambled data, therefore we need only three matrices.


(4) Using the routine from (3) calculate scrambled data for the Mout values given Min, when Nin=0. Each Min value is one-hot encoded, that is there is only one bit set. 

(5) Build MxM matrix, Each row contains the results from (4) in increasing order. For example, 1’st row contains the result of input=0×1, 2′nd row is input=0×2, etc. The output is M-bit wide, which the polynomial width.

(6) Calculate the Nout values given Nin, when Min=0. Each Nin value is one-hot encoded, that is there is only one bit set. 

(7) Build NxN matrix, Each row contains the results from (6) in increasing order. The output is N-bit wide, which the data width.

(8) Calculate the Nout values given Min, when Nin=0. Each Min value is one-hot encoded, that is there is only one bit set. 

(9) Build MxN matrix, Each row contains the results from (7) in increasing order. The output is N-bit wide, which the data width.

(10) Now, build an equation for each Nout[i] bit: all Nin[j] and Min[k] set bits in column [i] from three matrices participate in the equation. The participating inputs are XORed together.


Nout is the parallel scrambled data.


Keep me posted if the Parallel Scrambler Generation tool works for you, or you need more clarifications on the algorithm.


  1. Scramblers on Wiki

  1. Dan
    November 22nd, 2011 at 14:38 | #1


    Great sight! Unfortunately I am another person in need of a slightly different implementation. However I don’t believe it is related to self-synchronizing vs frame synchronous, as I have applications for both and the implementations are different than yours. I noticed that my generator polynomials are listed in reverse order to yours and that might indicate the difference. For example:

    My first polynomial is G(x) = 1 + x^9 + x^11 (OTN PN-11). The implementation does not have the XORs in between the shift registers, it only uses them in the feedback loop. For this example the D0 reg is fed by the XOR of D8 and D10. The data between the regs is a straight forward shift. BTW it is used self-synchronous.

    My second polynomial is G(x) = 1 + x + x^3 + x^12 + x^16 (OTN scrambler). That would also be a simple shift register with D0 fed by the XOR of D0, D2, D11 and D15. This one is used Frame synchronous.

    So I believe at least in my case the change I need in your generator is where the XORs are placed. In fact, my required implementation might actually be simpler.


  2. dharani
    December 5th, 2011 at 04:15 | #2

    My polynomial equation is 1+ x^3+ x^4 + x^5 + x^15
    input data is 8 bit and after scrambling my output data should be an 8 bit.
    Can u give a verilog code regarding this information.

  3. December 6th, 2011 at 10:49 | #3

    Thanks for writing this article!

    I am interested in modeling a PCI Express Scrambler (N=8, M=16), but I’m having trouble understanding how you compute the output in the final step. I’ve read the CRC article from circuit cellar and I understand the CRC5 example.

    However, for the PCIe scrambler, I don’t quite follow. I have all three of my matrices, but I don’t understand how you are combining the outputs.

    In the CRC paper, you had a 4×5 and a 5×5 matrix (column sizes are equal) and I see how these are combined to create the output equations. However, for a scrambler with N=8, M=16, you end up with 16×16, 8×8, and 16×8 matrices. The first matrix has 8 extra columns, do I just ignore these columns in the output equation?

  4. December 6th, 2011 at 11:32 | #4

    I think I have figured it out. For the PCIe scrambler, it would seem that we do not need to compute the NxN and MxN matrices as the input has no effect on the LFSR state.

    As such, we can just compute the MxM matrix to get the equations for next state of the LFSR. I looked at the Verilog output of your generate tool, and I have been able to reproduce the LFSR next state equations based on my own simulation.

  5. Babar
    December 23rd, 2011 at 06:15 | #5

    Very good post my friend. I am little bit in SATA and PCIE. I want to do the parallel scrambling. my G(x)=X^16+X^15+X^13+X^4+1 and my input 32 bit. I tried to learn an algorithm on SATA tutorial and for my problem.

    | |—–| |
    | | | |
    | | Reg | |—–| |
    |–>| |——————-|—->| *M1 | |
    | | | | |–|–> output (31 downto 16)
    | | | |—–|
    |—–| |
    | |—–|
    | | |
    |—->| *M1 |—-> output (15 downto 0)
    where M1 and M2 are 16X16 multiplication matrix.
    I could not understand how to construct M1 and M2

  6. December 23rd, 2011 at 13:28 | #6

    Hi Babar,

    I’m looking into SATA spec, section A.2.3 (example scrambler implementation). There is a C-code example on that page that can be directly converted to parallel scrambler implemented in Verilog/VHDL.

    next[31:16] equations are functions of now[15:0], and represent M1 matrix.
    You can look this way: next[31:16] = M1*now[15:0]. Each M1 element is either 1 or 0.

    next[15:0] equations are functions of now[15:0], and represent M2 matrix.


  7. Babar
    January 5th, 2012 at 07:19 | #7

    Hi Evgani,
    Thank you very much for reply.
    Actually I have the same document you have. My problem is M1 and M2. I am trying to understand how he constructed M1 and M2.
    If you have any idea how he constructed M1 and M2 please do reply.
    Thanks in advance.


  8. jaya S
    January 10th, 2012 at 01:33 | #8

    I have generated vhdl code for PCIe scrambler. It is generating the correct result working for 8bit data implementation.But not working for 16 and 32 bit data cases(results are differnt).

  9. January 10th, 2012 at 05:27 | #9


    It suggests there is something wrong with the byte order. Try scrambler with 16-bit data. Then feed the same data into an 8-bit scrambler. You should see the same result every other clock in the 8-bit one.


  10. Babar
    January 17th, 2012 at 05:57 | #10

    Hi Everyone,
    I am little bit confused about the CRC and Srambling sequence used in SATA because of my ignorance.
    My first confusion is that
    for SATA we have data coming like this SOF-FIS-CRC-EOF. The size of CRC is bounded like 4-bytes. If I calculate the CRC of every DWORD coming from FIS of size .i.e 1024 DWORD, where the CRC of this much size will reside (how the CRC look like of this much DWORDS).
    My second confusion is
    for the transmitter case I will calculate the CRC of incoming DWORD and the same DWORD will got to the scrambler, what about CRC DWORD? we need to scramble it too?
    Please if someone can clear these confusions I will be very thankful to you.
    Thanks in advance


  11. January 17th, 2012 at 18:57 | #11

    Hi Babar,

    CRC doesn’t grow in size with the new data. If you have 32-bit CRC, it remains the same size for every new data chunk. You just use an old CRC and data to calculate the new CRC.
    As far as scrambling CRC DWORD, the SATA specification should answer it (it has to be somewhere in the spec).


  12. Babar
    January 18th, 2012 at 01:17 | #12

    Thanks a lot for your very informative and urgent reply.



  13. Babar
    January 25th, 2012 at 06:35 | #13

    Hi Evgeni;

    I am generating VHDL code for parallel scrambler from this website for SATA. But I have a confusion that the C-code given on SerialATA_Revision_3_1 document is using the input for scrambler but for scrambling. Do you have any idea about that.
    Please just share your thoughts.

    Thanks in advance.

  14. January 25th, 2012 at 08:55 | #14

    Hi Babar,

    I didn’t understand your question. Can you please be more specific ?


  15. Babar
    January 25th, 2012 at 09:28 | #15

    What I know about scrambling is, we need to xor the scrambler output with input data. This I learned fro SATA tutorial. In SATA tutorial c-code for scrambler is given in appendix (as you know). This C-code do not use any input for scrambler but just getting the initial value 0xFOF6 and generating a continuous output with first output must be 0xC2D2768D. I verified this code. So after scrambler we just xor the data (we want to scramble) with scrambler output.

    The VHDL code that I generated from outputlogic is only the scrambler or it provide the scrambled data which I do not need to xor with input??. If this is producing scrambled data then it is not the first value as wrong value.

    I hope you understand

    Waiting for your feedback.




  16. January 25th, 2012 at 09:51 | #16

    Hi Babar,

    Right, scrambling used in SATA is a bit different than the code generated on this site. In general, there are several ways to do scrambling, and each protocol (e.g. SATA, PCI Express, USB, etc.) does it in a bit different way for various reasons. This has been discussed to some extent in the previous comments on this forum.


  17. Babar
    January 26th, 2012 at 04:28 | #17

    Yes you are right. Now I used the c-code for VHDL implementation and it works perfect.
    But I really appreciate your efforts to help people and I will definitely pay a salute to you for such a great work.

    Stockholm (Sweden)

  18. phani
    February 3rd, 2012 at 09:13 | #18

    16-bit parallel adder using registers

  19. Lawrence Jair
    February 21st, 2012 at 17:36 | #19

    I found the scrambler tool created code did not match Interlaken spec. Any idea?


  20. February 21st, 2012 at 17:58 | #20

    Hi Lawrence,

    I don’t have access to Interlaken spec. Can you please email me the pages dealing with Interlaken scrambling. I’ll take a look.


  21. April 9th, 2012 at 22:46 | #21

    Hi i am doing scrambler project, it srcambling the serial input data. totally i have 128 bit of input data and polynomial i have x^7+x^4+1 please can you help me to write testbanch using systen verilog language.

  22. majdi
    May 8th, 2012 at 03:24 | #22

    salut; je fait une projet sur le implementation scrambler et descrambler en vhdl et j’ai manque de programme descramble merci de me repend ou donner une siteweb specific

  23. May 8th, 2012 at 05:49 | #23


    In most cases, descrambler has the same implementation as scrambler.


  24. Silas
    June 7th, 2012 at 18:19 | #24


    I’m trying to use your webtool to generate 1+x^43 scrambler. But I’m not sure if your webtool output correct verilog code. e.g. data width=64,

    data_c[0] = data_in[0] ^ lfsr_q[42];
    data_c[1] = data_in[1] ^ lfsr_q[41];

    data_c[41] = data_in[41] ^ lfsr_q[1];
    data_c[42] = data_in[42] ^ lfsr_q[0];
    data_c[43] = data_in[43] ^ lfsr_q[42]; **** shouldn’t this be “data_in[43] ^ data_c[0]” instead of “lfsr_q[42] ?
    data_c[44] = data_in[44] ^ lfsr_q[41]; **** shouldn’t this be “data_in[44] ^ data_c[1]” instead of “lfsr_q[41"?

    After bit 42, the data_in[] should be XOR with data_c[], right? I’m confused. Please help.


  25. June 11th, 2012 at 08:41 | #25

    Hi Silas,

    Do you have another reference implementation that has a different code ?
    Also, one quick way to verify your concerns is to generate a serial scrambler and compare with the 64-bit one. Both should output the same scrambled data, of course.


  26. Rajib
    July 10th, 2012 at 23:11 | #26

    Can we just make an example of 8 bit data scrambler and descrambler example of mathematical solution…..I am trying to find example,but google ended up with thousands types which make me more confused………….i will highly appreciate if anyone give a simple example of scrabling and descambling of 8 bit data.

  27. Kevin
    June 26th, 2013 at 09:33 | #27

    Hi Evgeni, your website is very good. I am trying to incorporate your method into my scrambler design for IEEE802.3 Ethernet PCS. I wonder what’s the maximum clock frequency you the circuit can support, since I am working on 40G Ethernet project.

  28. June 26th, 2013 at 09:51 | #28

    Hi Kevin,

    Max frequency very much depends on the data width. I cannot even give a ballpark numbers – you’d need to generate the code and run thru synthesis and place-and-route tools to get the idea of performance.


  29. Kevin
    June 26th, 2013 at 13:14 | #29

    Thank you for your reply. I am using 64-bit width input, but may expand to 256-bit in the future. The clock is now my main concern. I also have another question, the other design I found in Xilinx IP core (Typically Xilinx xapp775), they are using another more direct way to implement the scrambler, that is N scrambler registers for N order polynominal. I wonder whether the actual function of your method will differ from that of those validated IP cores. It seems the method you proposed optimizes in area cost, which is what I want to make use of, because we have more important usage for the resource on our Altera Stratix IV. Thanks.@Evgeni

  30. June 26th, 2013 at 13:43 | #30

    Hi Kevin,

    Even for 256-bit input data the scrambler should have no problem running at 200MHz on Stratix-IV. I assume that you 40G datapath is running at about 125-150MHz.

    My method is to generate simple XOR trees. I found it yields the best results in terms of FPGA area.


  31. Kevin
    June 26th, 2013 at 15:50 | #31

    Thank you for your reply, I will compare and verify these 2 methods. I also want to explore a full cycle of AISC design using MOSIS project, now your website give me lots of help if I want to design some simple course practice IP cores. Thanks!!!@Evgeni

  32. Kevin
    June 27th, 2013 at 06:46 | #32

    Hi Evgeni, I tried both your code and those from Xilinx & Altera, it looks like yours is not working quite well since I quite understand those from Altera and Xilinx since they are following the IEEE802.3 standard for PCS scrambler. Can you explain why? The reference I have is Xilinx xapp775 and Altera Advanced Synthesis Cookbook @Evgeni

  33. June 27th, 2013 at 10:04 | #33

    Hi Kevin,

    Most common reason for the mismatch with the expected results are bit and byte ordering. Please take a look at previous comments of this thread.


  34. Kevin
    June 27th, 2013 at 12:30 | #34

    I overviewed all the comments before, they mentioned frame-sync and self-sync scrambler. Since I have no idea about the mathematical theory behind that, but all the reference design I found such as Xilinx and Altera follows the self-sync scrambler. The scrambler in IEEE802.3 and other standards for PCIe and Interlaken are all the same, as I suppose. So can you point out which industrial standard your method are compartible with? @Evgeni

  35. June 27th, 2013 at 16:11 | #35

    I used it in cores for PCIe, USB, SAS/SATA, and several wireless protocols. I haven’t personally used it in IEEE802.3 and Interlaken.

  36. Kevin
    June 28th, 2013 at 09:49 | #36

    To be safe, I intended use those code from Altera and Xilinx, but after this project, I may be able read those papers and figure out something to help you improve, thanks.@Evgeni

  37. Rubikian
    February 8th, 2014 at 22:25 | #37

    Hi Evgeni,

    I have a question when. When I trying to come out the descrambler code, I’m thinking i might need to have a initial seed for my polynomial. For example, you are setting {16{1′b1}} as your polynomial initial number.

    But the question is we can not predict what will be the input data from the upstream. So how should I set the initial seed for my polynomial.

    I did write the PRBS checker code before. I think that is easier, because i just need to buffer my previous incoming parallel data and XOR with the algorithm. The result of this XOR operation will match with the next incoming parallel data. But seem like scrambler code is a bit different compare than the PRBS checker…

    Would you mind to provide some advice on this?


  38. Rubikian
    February 9th, 2014 at 06:10 | #38

    I looked into PCI EXPRESS BASE SPEC document and found that the scramble and the descramble is exactly using the same algorithm. But when I try to mimic this in my logic and run it in the simulation, I’m just cant get back the original data from the descramble output…

    Really frustrating with the logic behind :(

    fixed data 8′h00 go in logic …scramble output = FF, 17, C0, 14 and so on
    but this set of the data go in the descramble logic, why i cant get back 8′h00 on the output side?

  39. February 9th, 2014 at 17:25 | #39

    Hi Rubikian,

    There is a section in PCI Express specification describing that scrambling and descrambling has to be synchronized in order to avoid the LFSR initialization problem you’ve described.
    Specifically, any COM character initializes LFSR with all-1’s.


  40. shalini
    July 15th, 2014 at 01:12 | #40


    I am trying to implement 8 bit scrambler for polynomial x4+x3+1 which is quite simple. But when i simulate the generated the code on edaplayground website i am not getting the proper output.I m getting the output as just 8′bxxxxxxxx. please help me with this. this is very important for me

  41. July 15th, 2014 at 09:28 | #41


    It sounds like some inputs are not properly connected to the testbench.


  42. shalini
    July 15th, 2014 at 20:46 | #42

    Hi Evgeni,

    Thank you so much for your reply.If possible please provide numerical example for the explanation of Parallel scrambler generator. Please consider my request. It will help a lot.

  43. September 11th, 2014 at 10:29 | #43

    Hi, I’m guessing that this is how I’m to contact server admin. I tried both on Firefox and IE and got the same error when I tried to generate vhdl for 1+x^39+x^58. Please advise. Thanks.

    Internal Server Error

    The server encountered an internal error or misconfiguration and was unable to complete your request.

    Please contact the server administrator, and inform them of the time the error occurred, and anything you might have done that may have caused the error.

    More information about this error may be available in the server error log.

    Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

  44. September 11th, 2014 at 10:46 | #44


    What data width are you using to generate the code ?
    It takes longer for wider data widths (like 512 or 1024 bit) to generate the code, so sometimes server times out.


  45. Mlap
    September 12th, 2014 at 06:28 | #45



    Thanks for the quick reply. I was using a data width of 640, too long as you replied. But then I reread 49.2.6 and I looked at your page 1 comments quoted here (I had only read page 2 comments to start with), and realized that my width is wrong and that there is no option to get self synchronous at this time.

    “Unfortunately, different protocols use different scrambler approaches given the same polynomial.
    Even in 802.3-2005_section4 spec WIS scrambler in section 50.3.3 has scrambler LFSR independent of the input data, whereas 66/64 bit scrambler in section 49.2.6 has scrambler LFSR dependent of the input data.

    This is an important difference and I’ll need to provide an option to generate scrambler code either way.”

  46. Thomas
    November 24th, 2014 at 03:51 | #46

    Hi Evgeni, quick question. I cant quite understand, the polynomial selection for the scrambler. I have used your CRC tool before (thanks by the way), and polynomial selection there determines the maximum Hamming Distance, a very important aspect of CRC. What would different polynomials determine for the scrambler?

  47. November 24th, 2014 at 08:19 | #47

    Hi Thomas,

    An important property for scrambler polynomial is randomness, such that the output spectrum looks like white noise in the frequency domain.
    A good example would be polynomials that generate different PRBS sequences with different spectrum characteristics.
    This article written by my colleague might be helpful: http://cdn.teledynelecroy.com/files/whitepapers/designcon2013_understanding_apparent_increasing_random_jitter_with_increasing_prbs_test_pattern_lengths.pdf


  48. Thomas
    November 25th, 2014 at 01:55 | #48

    many thanks for your time

  49. abhishek
    June 10th, 2015 at 23:23 | #49

    hi,i am not able to understand the theory behind this scrambler generation,can u help me in this regard?
    Also i wanna know that is this can be implemented on FPGA

  50. Ravikumar
    October 21st, 2015 at 21:17 | #50

    Hi Evgeni,
    I need to design a scrambler with DATAWIDTH=16(N) and LFSR polynomial width=16(M). I am able to get the proper results for N=8 and M=16; but I am not able to create equations for N=16 and M=16; Please can you help me out?

    Thanks in Advance !!!


Comment pages
  1. December 19th, 2015 at 06:58 | #1