## Parallel Scrambler Generator

Scramblers are used in many communicaton protocols such as PCI Express, SAS/SATA, USB, Bluetooth to randomize the transmitted data. To keep this post short and focused I’ll not discuss the theory behind scramblers. For more information about scramblers see [1], or do some googling. The topic of this post is the parallel implementation of a scrambler generator. Protocol specifications define scrambling algorithm using either hex or polynimial notation. This is not always suitable for efficient hardware or software implementation. Please read my post on parallel CRC Generator about that.

The Parallel Scrambler Generator method that I’m going to describe has a lot in common with the Parallel CRC Generator. The difference is that CRC generator outputs CRC value, whereas Scrambler generator produces scrambled data. But the internal working of both based on the same principle.

Here is an example of a scrambler with the polynomial G(x) = x^{16}+x^{5}+x^{4}+x^{3}+1

Following is the description of the Parallel Scrambler Generator algorithm:

(1) Let’s denote N=data width, M=generator polynomial width.

(2) Implement serial scrambler generator using given polynomial or hex notation. It’s easy to do in any programming language or script: C, Java, Perl, Verilog, etc.

(3) Parallel Scrambler implementation is a function of N-bit data input as well as M-bit current state of the polynomial, as shown in the above figure. We’re going to build three matrices:

- Mout (next state polynomial) as a function of Min(current state polynomial) when N=0 and
- Nout as a function of Nin when Min=0.
- Nout as a function of Min when Nin=0

Note that the polynomial next state doesn’t depend on the scrambled data, therefore we need only three matrices.

(4) Using the routine from (3) calculate scrambled data for the Mout values given Min, when Nin=0. Each Min value is one-hot encoded, that is there is only one bit set.

(5) Build MxM matrix, Each row contains the results from (4) in increasing order. For example, 1’st row contains the result of input=0×1, 2′nd row is input=0×2, etc. The output is M-bit wide, which the polynomial width.

(6) Calculate the Nout values given Nin, when Min=0. Each Nin value is one-hot encoded, that is there is only one bit set.

(7) Build NxN matrix, Each row contains the results from (6) in increasing order. The output is N-bit wide, which the data width.

(8) Calculate the Nout values given Min, when Nin=0. Each Min value is one-hot encoded, that is there is only one bit set.

(9) Build MxN matrix, Each row contains the results from (7) in increasing order. The output is N-bit wide, which the data width.

(10) Now, build an equation for each Nout[i] bit: all Nin[j] and Min[k] set bits in column [i] from three matrices participate in the equation. The participating inputs are XORed together.

Nout is the parallel scrambled data.

Keep me posted if the Parallel Scrambler Generation tool works for you, or you need more clarifications on the algorithm.

**References**:

Evgeni,

Great sight! Unfortunately I am another person in need of a slightly different implementation. However I don’t believe it is related to self-synchronizing vs frame synchronous, as I have applications for both and the implementations are different than yours. I noticed that my generator polynomials are listed in reverse order to yours and that might indicate the difference. For example:

My first polynomial is G(x) = 1 + x^9 + x^11 (OTN PN-11). The implementation does not have the XORs in between the shift registers, it only uses them in the feedback loop. For this example the D0 reg is fed by the XOR of D8 and D10. The data between the regs is a straight forward shift. BTW it is used self-synchronous.

My second polynomial is G(x) = 1 + x + x^3 + x^12 + x^16 (OTN scrambler). That would also be a simple shift register with D0 fed by the XOR of D0, D2, D11 and D15. This one is used Frame synchronous.

So I believe at least in my case the change I need in your generator is where the XORs are placed. In fact, my required implementation might actually be simpler.

Thanks

My polynomial equation is 1+ x^3+ x^4 + x^5 + x^15

input data is 8 bit and after scrambling my output data should be an 8 bit.

Can u give a verilog code regarding this information.

Thanks for writing this article!

I am interested in modeling a PCI Express Scrambler (N=8, M=16), but I’m having trouble understanding how you compute the output in the final step. I’ve read the CRC article from circuit cellar and I understand the CRC5 example.

However, for the PCIe scrambler, I don’t quite follow. I have all three of my matrices, but I don’t understand how you are combining the outputs.

In the CRC paper, you had a 4×5 and a 5×5 matrix (column sizes are equal) and I see how these are combined to create the output equations. However, for a scrambler with N=8, M=16, you end up with 16×16, 8×8, and 16×8 matrices. The first matrix has 8 extra columns, do I just ignore these columns in the output equation?

I think I have figured it out. For the PCIe scrambler, it would seem that we do not need to compute the NxN and MxN matrices as the input has no effect on the LFSR state.

As such, we can just compute the MxM matrix to get the equations for next state of the LFSR. I looked at the Verilog output of your generate tool, and I have been able to reproduce the LFSR next state equations based on my own simulation.

Hi,

Very good post my friend. I am little bit in SATA and PCIE. I want to do the parallel scrambling. my G(x)=X^16+X^15+X^13+X^4+1 and my input 32 bit. I tried to learn an algorithm on SATA tutorial and for my problem.

|——————————————–|

| |—–| |

| | | |

| | Reg | |—–| |

|–>| |——————-|—->| *M1 | |

| | | | |–|–> output (31 downto 16)

| | | |—–|

|—–| |

| |—–|

| | |

|—->| *M1 |—-> output (15 downto 0)

|—–|

where M1 and M2 are 16X16 multiplication matrix.

I could not understand how to construct M1 and M2

@Babar

Hi Babar,

I’m looking into SATA spec, section A.2.3 (example scrambler implementation). There is a C-code example on that page that can be directly converted to parallel scrambler implemented in Verilog/VHDL.

next[31:16] equations are functions of now[15:0], and represent M1 matrix.

You can look this way: next[31:16] = M1*now[15:0]. Each M1 element is either 1 or 0.

next[15:0] equations are functions of now[15:0], and represent M2 matrix.

Thanks,

Evgeni

Hi Evgani,

Thank you very much for reply.

Actually I have the same document you have. My problem is M1 and M2. I am trying to understand how he constructed M1 and M2.

If you have any idea how he constructed M1 and M2 please do reply.

Thanks in advance.

Babar

I have generated vhdl code for PCIe scrambler. It is generating the correct result working for 8bit data implementation.But not working for 16 and 32 bit data cases(results are differnt).

Hi,

It suggests there is something wrong with the byte order. Try scrambler with 16-bit data. Then feed the same data into an 8-bit scrambler. You should see the same result every other clock in the 8-bit one.

Thanks,

Evgeni

Hi Everyone,

I am little bit confused about the CRC and Srambling sequence used in SATA because of my ignorance.

My first confusion is that

for SATA we have data coming like this SOF-FIS-CRC-EOF. The size of CRC is bounded like 4-bytes. If I calculate the CRC of every DWORD coming from FIS of size .i.e 1024 DWORD, where the CRC of this much size will reside (how the CRC look like of this much DWORDS).

My second confusion is

for the transmitter case I will calculate the CRC of incoming DWORD and the same DWORD will got to the scrambler, what about CRC DWORD? we need to scramble it too?

Please if someone can clear these confusions I will be very thankful to you.

Thanks in advance

/Babar

Hi Babar,

CRC doesn’t grow in size with the new data. If you have 32-bit CRC, it remains the same size for every new data chunk. You just use an old CRC and data to calculate the new CRC.

As far as scrambling CRC DWORD, the SATA specification should answer it (it has to be somewhere in the spec).

Thanks,

Evgeni

@Evgeni

Thanks a lot for your very informative and urgent reply.

Regards

/Babar

Hi Evgeni;

I am generating VHDL code for parallel scrambler from this website for SATA. But I have a confusion that the C-code given on SerialATA_Revision_3_1 document is using the input for scrambler but for scrambling. Do you have any idea about that.

Please just share your thoughts.

Thanks in advance.

Regards

/Babar

Hi Babar,

I didn’t understand your question. Can you please be more specific ?

Thanks,

Evgeni

@Evgeni

Hi,

What I know about scrambling is, we need to xor the scrambler output with input data. This I learned fro SATA tutorial. In SATA tutorial c-code for scrambler is given in appendix (as you know). This C-code do not use any input for scrambler but just getting the initial value 0xFOF6 and generating a continuous output with first output must be 0xC2D2768D. I verified this code. So after scrambler we just xor the data (we want to scramble) with scrambler output.

The VHDL code that I generated from outputlogic is only the scrambler or it provide the scrambled data which I do not need to xor with input??. If this is producing scrambled data then it is not the first value as wrong value.

I hope you understand

Waiting for your feedback.

Thanks

BR

Babar

Hi Babar,

Right, scrambling used in SATA is a bit different than the code generated on this site. In general, there are several ways to do scrambling, and each protocol (e.g. SATA, PCI Express, USB, etc.) does it in a bit different way for various reasons. This has been discussed to some extent in the previous comments on this forum.

Thanks,

Evgeni

@Evgeni

Hi,

Yes you are right. Now I used the c-code for VHDL implementation and it works perfect.

But I really appreciate your efforts to help people and I will definitely pay a salute to you for such a great work.

Regards

Babar

Stockholm (Sweden)

16-bit parallel adder using registers

Hi,

I found the scrambler tool created code did not match Interlaken spec. Any idea?

thanks,

Lawrence

Hi Lawrence,

I don’t have access to Interlaken spec. Can you please email me the pages dealing with Interlaken scrambling. I’ll take a look.

Thanks,

Evgeni

Hi i am doing scrambler project, it srcambling the serial input data. totally i have 128 bit of input data and polynomial i have x^7+x^4+1 please can you help me to write testbanch using systen verilog language.

salut; je fait une projet sur le implementation scrambler et descrambler en vhdl et j’ai manque de programme descramble merci de me repend ou donner une siteweb specific

Hi,

In most cases, descrambler has the same implementation as scrambler.

Thanks,

Evgeni

Hi,

I’m trying to use your webtool to generate 1+x^43 scrambler. But I’m not sure if your webtool output correct verilog code. e.g. data width=64,

…

data_c[0] = data_in[0] ^ lfsr_q[42];

data_c[1] = data_in[1] ^ lfsr_q[41];

…

data_c[41] = data_in[41] ^ lfsr_q[1];

data_c[42] = data_in[42] ^ lfsr_q[0];

data_c[43] = data_in[43] ^ lfsr_q[42]; **** shouldn’t this be “data_in[43] ^ data_c[0]” instead of “lfsr_q[42] ?

data_c[44] = data_in[44] ^ lfsr_q[41]; **** shouldn’t this be “data_in[44] ^ data_c[1]” instead of “lfsr_q[41"?

...

After bit 42, the data_in[] should be XOR with data_c[], right? I’m confused. Please help.

Thanks,

Silas

Hi Silas,

Do you have another reference implementation that has a different code ?

Also, one quick way to verify your concerns is to generate a serial scrambler and compare with the 64-bit one. Both should output the same scrambled data, of course.

Thanks,

Evgeni

Can we just make an example of 8 bit data scrambler and descrambler example of mathematical solution…..I am trying to find example,but google ended up with thousands types which make me more confused………….i will highly appreciate if anyone give a simple example of scrabling and descambling of 8 bit data.

Hi Evgeni, your website is very good. I am trying to incorporate your method into my scrambler design for IEEE802.3 Ethernet PCS. I wonder what’s the maximum clock frequency you the circuit can support, since I am working on 40G Ethernet project.

Hi Kevin,

Max frequency very much depends on the data width. I cannot even give a ballpark numbers – you’d need to generate the code and run thru synthesis and place-and-route tools to get the idea of performance.

Thanks,

Evgeni

Thank you for your reply. I am using 64-bit width input, but may expand to 256-bit in the future. The clock is now my main concern. I also have another question, the other design I found in Xilinx IP core (Typically Xilinx xapp775), they are using another more direct way to implement the scrambler, that is N scrambler registers for N order polynominal. I wonder whether the actual function of your method will differ from that of those validated IP cores. It seems the method you proposed optimizes in area cost, which is what I want to make use of, because we have more important usage for the resource on our Altera Stratix IV. Thanks.@Evgeni

Hi Kevin,

Even for 256-bit input data the scrambler should have no problem running at 200MHz on Stratix-IV. I assume that you 40G datapath is running at about 125-150MHz.

My method is to generate simple XOR trees. I found it yields the best results in terms of FPGA area.

Thanks,

Evgeni

Thank you for your reply, I will compare and verify these 2 methods. I also want to explore a full cycle of AISC design using MOSIS project, now your website give me lots of help if I want to design some simple course practice IP cores. Thanks!!!@Evgeni

Hi Evgeni, I tried both your code and those from Xilinx & Altera, it looks like yours is not working quite well since I quite understand those from Altera and Xilinx since they are following the IEEE802.3 standard for PCS scrambler. Can you explain why? The reference I have is Xilinx xapp775 and Altera Advanced Synthesis Cookbook @Evgeni

Hi Kevin,

Most common reason for the mismatch with the expected results are bit and byte ordering. Please take a look at previous comments of this thread.

Thanks,

Evgeni

I overviewed all the comments before, they mentioned frame-sync and self-sync scrambler. Since I have no idea about the mathematical theory behind that, but all the reference design I found such as Xilinx and Altera follows the self-sync scrambler. The scrambler in IEEE802.3 and other standards for PCIe and Interlaken are all the same, as I suppose. So can you point out which industrial standard your method are compartible with? @Evgeni

I used it in cores for PCIe, USB, SAS/SATA, and several wireless protocols. I haven’t personally used it in IEEE802.3 and Interlaken.

To be safe, I intended use those code from Altera and Xilinx, but after this project, I may be able read those papers and figure out something to help you improve, thanks.@Evgeni

Hi Evgeni,

I have a question when. When I trying to come out the descrambler code, I’m thinking i might need to have a initial seed for my polynomial. For example, you are setting {16{1′b1}} as your polynomial initial number.

But the question is we can not predict what will be the input data from the upstream. So how should I set the initial seed for my polynomial.

I did write the PRBS checker code before. I think that is easier, because i just need to buffer my previous incoming parallel data and XOR with the algorithm. The result of this XOR operation will match with the next incoming parallel data. But seem like scrambler code is a bit different compare than the PRBS checker…

Would you mind to provide some advice on this?

Thanks,

Rubikian

I looked into PCI EXPRESS BASE SPEC document and found that the scramble and the descramble is exactly using the same algorithm. But when I try to mimic this in my logic and run it in the simulation, I’m just cant get back the original data from the descramble output…

Really frustrating with the logic behind

fixed data 8′h00 go in logic …scramble output = FF, 17, C0, 14 and so on

but this set of the data go in the descramble logic, why i cant get back 8′h00 on the output side?

Hi Rubikian,

There is a section in PCI Express specification describing that scrambling and descrambling has to be synchronized in order to avoid the LFSR initialization problem you’ve described.

Specifically, any COM character initializes LFSR with all-1’s.

Thanks,

Evgeni

Hi,

I am trying to implement 8 bit scrambler for polynomial x4+x3+1 which is quite simple. But when i simulate the generated the code on edaplayground website i am not getting the proper output.I m getting the output as just 8′bxxxxxxxx. please help me with this. this is very important for me

Hi,

It sounds like some inputs are not properly connected to the testbench.

Thanks,

Evgeni

Hi Evgeni,

Thank you so much for your reply.If possible please provide numerical example for the explanation of Parallel scrambler generator. Please consider my request. It will help a lot.

Hi, I’m guessing that this is how I’m to contact server admin. I tried both on Firefox and IE and got the same error when I tried to generate vhdl for 1+x^39+x^58. Please advise. Thanks.

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

Hi,

What data width are you using to generate the code ?

It takes longer for wider data widths (like 512 or 1024 bit) to generate the code, so sometimes server times out.

Thanks,

Evgnei

@Evgeni

Hi,

Thanks for the quick reply. I was using a data width of 640, too long as you replied. But then I reread 49.2.6 and I looked at your page 1 comments quoted here (I had only read page 2 comments to start with), and realized that my width is wrong and that there is no option to get self synchronous at this time.

“Unfortunately, different protocols use different scrambler approaches given the same polynomial.

Even in 802.3-2005_section4 spec WIS scrambler in section 50.3.3 has scrambler LFSR independent of the input data, whereas 66/64 bit scrambler in section 49.2.6 has scrambler LFSR dependent of the input data.

This is an important difference and I’ll need to provide an option to generate scrambler code either way.”

Hi Evgeni, quick question. I cant quite understand, the polynomial selection for the scrambler. I have used your CRC tool before (thanks by the way), and polynomial selection there determines the maximum Hamming Distance, a very important aspect of CRC. What would different polynomials determine for the scrambler?

Hi Thomas,

An important property for scrambler polynomial is randomness, such that the output spectrum looks like white noise in the frequency domain.

A good example would be polynomials that generate different PRBS sequences with different spectrum characteristics.

This article written by my colleague might be helpful: http://cdn.teledynelecroy.com/files/whitepapers/designcon2013_understanding_apparent_increasing_random_jitter_with_increasing_prbs_test_pattern_lengths.pdf

Thanks,

Evgeni

many thanks for your time

hi,i am not able to understand the theory behind this scrambler generation,can u help me in this regard?

Also i wanna know that is this can be implemented on FPGA

Hi Evgeni,

I need to design a scrambler with DATAWIDTH=16(N) and LFSR polynomial width=16(M). I am able to get the proper results for N=8 and M=16; but I am not able to create equations for N=16 and M=16; Please can you help me out?

Thanks in Advance !!!

Regards

Ravikumar