Archive

Author Archive

Book: 100 Power Tips for FPGA Designers

May 23rd, 2011 76 comments



Front cover

This book is a collection of articles on various aspects of FPGA design: synthesis, simulation, porting ASIC designs, floorplanning and timing closure, design methodologies, performance, area and power optimizations, RTL coding, IP core selection, and many others.

The book is intended for system architects, design engineers, and students who want to improve their FPGA design skills. Both novice and seasoned logic and hardware engineers can find bits of useful information.

This book is written by a practicing FPGA logic designer, and contains a lot of illustrations, code examples, and scripts. Rather than providing information applicable to all FPGA vendors, this book edition focuses on Xilinx Virtex-6 and Spartan-6 FPGA families. Code examples are written in Verilog HDL.


Download excerpt from the book
Download source code, projects, and scripts


Paperback edition on Amazon.com , Amazon.de, and Amazon.co.uk

Number of pages: 474
Publisher: CreateSpace
ISBN:978-1461186298


Kindle edition on Amazon.com

The book can be read in color on a PC or MAC using free Kindle for PC or Kindle for MAC application.
It can also be read on an iPhone or iPad using free Kindle for iPhone or Kindle for iPad application.


Flipkart.com
Readers based in India can purchase the book on Flipkart.com


www.phei.com.cn
Chinese-speaking readers can purchase the book on PHEI


Google eBook edition
The book can be read in color on a PC, MAC, Tablet/iPad. Extensive preview is available.


ePub edition on Barnes and Noble
The book can also be read using free Nook for PC, Adobe Digital Edition applications, or on other eReaders that support ePub format.




Any questions, comments, suggestions about the book are welcome.


Using Xilinx tools in command-line mode

April 30th, 2011 12 comments


Many FPGA designers don’t take good advantage of the command-line options that FPGA synthesis and physical implementation tools have to offer. I’ve published an article on the subject in Xilinx Xcell journal. You can read it here. All the source code, projects, and scripts are available here.


Download the article
Download the source code, projects, and scripts




Tags:

ReportXplorer

November 12th, 2010 3 comments


ReportXplorer is a web application that allows users to view and analyze Xilinx® reports
ReportXplorer user guide


There are several advantages of using ReportXplorer:

No installation: because it’s a web application, it doesn’t require installation. ReportXplorer uses Adobe Flash plug-in, which is already installed on most of the web browsers. The application can be used on any computer and in any operating system environment, including Mac OS and mobile devices.

Ease-of-use: It only takes two steps and a few seconds to open multiple Xilinx reports: enter the application URL in a browser, and navigate to reports within a “Load Reports” dialog.

Time-saver: Engineers spend a lot of valuable time opening and analyzing reports scattered in different directories on different machines. ReportXplorer helps reduce that time, and make the process more organized and productive.

Analytics: ReportXplorer parses reports and provides instant visual analytics that enables rapid comprehension of critical information, and can help find design problems. That is done much faster than by analyzing text-based reports, using existing tools, or running custom search scripts. Users can open multiple instances of the application in different browser tabs. That allows side by side comparison of report sections, analysis of trends between builds, or identifying potential problems such as new warnings or high logic utilization.

Security: the application has been developed with security as the most important requirement. ReportXplorer is inherently secure because it’s entirely client-based. No confidential design information contained in the reports is sent to the server. All the report processing is done locally on a client inside a web browser sandbox.

Fast response time: ReportXplorer is designed to allow customization and easy addition of new features. It’s a small application supported by a team of practicing logic designers and software engineers. Hence, the response time to add a new feature or fix a bug is fast. You don’t need to wait several months for the “next release”.


Use Cases: ReportXplorer can be used in the following cases and situations:

  • To provide more report viewing and analysis capabilities comparing to existing tools
  • To enable report viewing in a system that doesn’t have native tools installed, such as on a mobile device
  • To enable report viewing of a build that doesn’t have an associated Xilinx ISE project, for example builds from script
  • Side-by-side comparison of multiple reports opened in the same or different applications



  ReportXplorer is written using Adobe Flex and Action Script. Adobe Flex is a layer on top of Adobe Flash, and allows easy development of RIA (rich internet applications). There has been an interesting process of selecting the right technology for this application. We evaluated several options, including Microsoft Silverlight, several JavaScript libraries, and HTML5. Microsoft Silverlight is not well supported by OS other than Windows. Although HTML5 has all the features to do full client-based report processing, but it’s still an emerging technology with limited browser support. JavaScript doesn’t allow opening a file and processing it in a browser without sending it to the server first. This is a security measure. Adobe Flex was the best fit for meeting all the requirements.


  Another decision was not to make the application an open source. This is a relatively small application, and the associated management overhead to ensure good quality doesn’t worth it. Participating in development of the application requires good Adobe Flex programming skills, which are not as common as C/C++/Java. Also, Adobe Flex development tools are not free. At this moment [Nov 2010] the application is released as beta and free of charge. We reserve the right to charge a fee later on to cover development costs.



  That was a brief introduction. I’d like to encourage visitors to become active users of this tool, and to post comments with new feature requests, bug reports, or just leave a feedback.




LFSR Counters – Part 3

May 11th, 2009 4 comments


  Here is how the LFSR Counter Generator works:

(1) Specify counter value, e.g. 200. It’s 8 bits, so the tool selects 8-bit LFSR with polynomial coefficients taken from the table in [1].

(2) Reset LFSR to 0, run a loop that shifts the LFSR 200 times. Then latch its value (LFSR_COUNT_VAL).

(3) Use that 8-bit LFSR and LFSR_COUNT_VAL to generate a Verilog code. When the LFSR hits LFSR_COUNT_VAL, it counted 200.

This approach is working because the polynomial selected in (1) has a maximum-length property. That is it generates a sequence of unique values from 0 to 2n-1.

I synthesized a 32-bit LFSR counter for Xilinx Virtex5 chip  and compared its size with a regular 32-bit counter.

Here are the results:

Module Slices Regs LUTs
regular_counter 17 32 44
lfsr_counter 10 32 7


References

  1. Peter Alfke, Efficient Shift Registers, LFSR Counters, and Long Pseudo-Random Sequence Generators, Xilinx application note Xapp052




LFSR Counters – Part 2

May 11th, 2009 No comments


  Because generating the code for LFSR counter is a computation- and memory-intensive operation, and it’s running on a server, the server usually times out after the counter value exceeds ~22 bits. I rearchitected the tool in such a way that if requested counter size is greater than 20 bit, it’s sent to a server in chunks ot 20 bits. To implement that I was using a standard AJAX approach: XmlHTPRequest and callback. That also allowed me to put a progress bar – I used  jsProgressBarHandler from Bram.us.

  Still, it’s a quite slow operation, so I limited the LFSR Counter size to 31 bit for practical purposes. There is no fundamental problem with that. It can be as large as 168 bit, it’d just take forever to complete.

I also created a stand-alone application that can generate large LFSR counters faster. Download it from SourceForge.




Parallel Scrambler Generator

May 5th, 2009 128 comments


  Scramblers are used in many communicaton protocols such as PCI Express, SAS/SATA, USB, Bluetooth to randomize the transmitted data. To keep this post short and focused I’ll not discuss the theory behind scramblers. For more information about scramblers see [1], or do some googling.  The topic of this post is the parallel implementation of a scrambler generator. Protocol specifications define scrambling algorithm using either hex or polynimial notation. This is not always suitable for efficient hardware or software implementation. Please read my post on parallel CRC Generator about that.

 The Parallel Scrambler Generator method that I’m going to describe has a lot in common with the Parallel CRC Generator. The difference is that CRC generator outputs CRC value, whereas Scrambler generator produces scrambled data. But the internal working of both based on the same principle.

Here is an example of a scrambler with the polynomial G(x) = x16+x5+x4+x3+1

scrambler1

 

Following is the description of the Parallel Scrambler Generator algorithm:

(1) Let’s denote N=data width, M=generator polynomial width. scrambler21

(2) Implement serial scrambler generator using given polynomial or hex notation. It’s easy to do in any programming language or script: C, Java, Perl, Verilog, etc. 

(3) Parallel Scrambler implementation is a function of N-bit data input as well as M-bit current state of the polynomial, as shown in the above figure. We’re going to build three matrices:

  • Mout (next state polynomial) as a function of Min(current state polynomial) when N=0 and
  • Nout as a function of Nin when Min=0. 
  • Nout as a function of Min when Nin=0

      Note that the polynomial next state doesn’t depend on the scrambled data, therefore we need only three matrices.

 

(4) Using the routine from (3) calculate scrambled data for the Mout values given Min, when Nin=0. Each Min value is one-hot encoded, that is there is only one bit set. 

(5) Build MxM matrix, Each row contains the results from (4) in increasing order. For example, 1’st row contains the result of input=0×1, 2′nd row is input=0×2, etc. The output is M-bit wide, which the polynomial width.

(6) Calculate the Nout values given Nin, when Min=0. Each Nin value is one-hot encoded, that is there is only one bit set. 

(7) Build NxN matrix, Each row contains the results from (6) in increasing order. The output is N-bit wide, which the data width.

(8) Calculate the Nout values given Min, when Nin=0. Each Min value is one-hot encoded, that is there is only one bit set. 

(9) Build MxN matrix, Each row contains the results from (7) in increasing order. The output is N-bit wide, which the data width.

(10) Now, build an equation for each Nout[i] bit: all Nin[j] and Min[k] set bits in column [i] from three matrices participate in the equation. The participating inputs are XORed together.

 

Nout is the parallel scrambled data.

 

Keep me posted if the Parallel Scrambler Generation tool works for you, or you need more clarifications on the algorithm.


References:

  1. Scramblers on Wiki




Parallel CRC Generator

May 5th, 2009 218 comments


Download a full version of this article


Every modern communication protocol uses one or more error detection algorithms. Cyclic Redundancy Check, or CRC, is by far the most popular one. CRC properties are defined by the generator polynomial length and coefficients. The protocol specification usually defines CRC in hex or polynomial notation. For example, CRC5 used in USB 2.0 protocol is represented as 0x5 in hex notation or as G(x)=x5+x2+1 in the polynomial. This CRC is implemented in hardware as a shift register as shown in the following picture.crc5

The problem is that in many cases shift register implementation is suboptimal. It only allows the calculation of one bit every clock. If a design has 32-bit wide datapath, meaning that every clock CRC module has to calculate CRC on 32-bit of data, this scheme will not work. Somehow this serial shift register implementation has to be converted into a parallel N-bit wide circuit, where N is the design datapath width, so that every clock N bits are processed.

I started researching the available literature on parallel CRC calculation methods and found only a handful of papers ([2], [3]) that deal with this issue. Most sources are academic and focus on the theoretical aspect of the problem. They are too impractical to implement in software or hardware for a quick code generation.

I came up with the following scheme that I’ve used to build an online Parallel CRC Generator tool. Here is a description of the steps in which I make use USB CRC5 mentioned above.    crc5-parallel

(1) Let’s denote N=data width, M=CRC width. For example, if we want to generate parallel USB CRC5 for 4-bit datapath, N=4, M=5.

(2) Implement serial CRC generator routine using given polynomial or hex notation. It’s easy to do in any programming language or script: C, Java, Perl, Verilog, etc.

(3) Parallel CRC implementation is a function of N-bit data input as well as M-bit current state CRC, as shown in the above figure. We’re going to build two matrices: Mout (next state CRC) as a function of Min(current state CRC) when N=0 and Mout as a function of Nin when M=0.

(4) Using the routine from (2) calculate CRC for the N values when Min=0. Each value is one-hot encoded, that is there is only one bit set. For N=4 the values are 0x1, 0x2, 0x4, 0x8.  Mout = F(Nin,Min=0)

(5) Build NxM matrix, Each row contains the results from (3) in increasing order. For example, 1’st row contains the result of input=0x1, 2’nd row is input=0x2, etc. The output is M-bit wide, which the desired CRC width. Here is the matrix for USB CRC5 with N=4.

matrix1

(6) Each column in this matrix, and that’s the interesting part, represents an output bit Mout[i] as a function of Nin.

(7) Using the routine from (3) calculate CRC for the M values when Nin=0. Each value is one-hot encoded, that is there is only one bit set. For M=5 the values are 0x1, 0x2, 0x4, 0x8, 0x10.  Mout = F(Nin=0,Min)

(8) Build MxM matrix, Each row contains the results from (7) in increasing order. Here is the matrix for USB CRC5 with N=4

matrix2(9) Now, build an equation for each Mout[i] bit: all Nin[j] and Min[k] bits in column [i] participate in the equation. The participating inputs are XORed together.

Mout[0] = Min[1]^Min[4]^Nin[0]^Nin[3]

Mout[1] = Min[2]^Nin[1]

Mout[2] = Min[1]^Min[3]^Min[4]^Nin[0]^Nin[2]^Nin[3]

Mout[3] = Min[2]^Min[4]^Nin[1]^Nin[3]

Mout[4] = Min[0]^Min[3]^Nin[2]

That is our parallel CRC.

I presume since the invention of the CRC algorithm more than 40 years ago, somebody has already came up with this approach. I just coulnd’t find it and “reinvented the wheel”.

Keep me posted if the CRC Generation tool works for you, or you need more clarifications on the algorithm.


[September 29th, 2010] Users frequently ask why their implementation of serial CRC doesn’t match the generated parallel CRC with the same polynomial. There are few reasons for that:
– bit order into serial CRC isn’t the same as how the data is fed into the parallel CRC
– input data bits are inverted
– LFSR is not initialized the same way. A lot of protocols initialize it with F-s, and that’s what is done in the parallel CRC.


Download a full version of this article


Error Control Coding


References

  1. CRC on Wikipedia
  2. G. Campobello, G Patane, M Russo, “Parallel CRC Realization”
  3. W Lu, S. Wong, “A Fast CRC Update Implementation”






LFSR Counters

May 4th, 2009 37 comments


  Most of the EE or CS graduates know or at least have heard about different types of hardware counters: prescaled, Johnson, ripple carry, linear feedback shift register (LFSR), and others.
The majority of logic designers use the first two types, because they’re simple to implement in Verilog or VHDL. However, for some applications LFSR counters offer a significant advantage in terms of logic utilization and maximum frequency.
The other day I run into Xilinx LFSR Counter core and decided to explore its advantages. I was so impressed with its area saving comparing with regular counters that I decided to write an online tool that generates a Verilog code for an LFSR counter of an arbitrary value.
This LFSR Counter Generator tool is running on the server. The time it takes to generate the code depends exponentially on the counter size. It takes several seconds to generate a 20-bit counter. But bigger counters cause the server to timeout with the current tool implementation.
I’m planning to tweak the implementation to be able to generate counters up to ~30 bits. More than that would take too long no matter what approach is taken.

The Art of Error Correcting Coding

Please post you comments about the experience with the tool, features you’d like to add, and the issues you’ve seen.


References:

  1. Peter Alfke, Efficient Shift Registers, LFSR Counters, and Long Pseudo-Random Sequence Generators,
    Xilinx application note  Xapp052
  2. Maria George and Peter Alfke, Linear Feedback Shift Registers in Virtex Devices, Xilinx application note  Xapp210
  3. Xilinx Linear Feedback Shift Register (LFSR) Logic Core


Welcome to OutputLogic.com

May 2nd, 2009 7 comments


  Juggling between my full time job as a logic designer and my personal life I found some time to bootstrap OutputLogic.com. I wanted to share some of the tools and ideas I’ve accumulated over the years and now it’s a good time to do so.

An important design choice that I’ve made was to have those tools entirely web-based. In my opinion, it’s so much more convenient and user-friendly than having lots of scattered applications, written in different languages, for different operating systems, and with inconsistent user interfaces.

It looks like a web browser is becoming the focal point of the user interaction with the computer. More and more high-quality web application are showing up and getting a widespread adoption. Just to mention a few: Gmail, Google Maps, Google Documents. Just a few years ago you’d use a standalone application for sending an email, finding a direction, or creating a spreadsheet.

But writing a decent web-based application requires a set of very specialized skills that takes time to acquire and master. It’s no more just cranking an HTML code and peppering it with some JavaScript. One needs to be familiar with half a dozen scripting languages, several application frameworks, databases. For lots of people it is a full-time job.
Nevertheless, another design choice was to develop everything myself. I knew it’d be quite a challenge, but that’s exactly what makes the process so fun.

I don’t have much experience doing web design, which means that implementing things that I want and how I want takes a lot more time than it should. Things like getting around JavaScript quirks, figuring out why GCI doesn’t work, debugging Perl scripts, reverse-engineering piles of PHP code, finding the right framework for the site and development tools to work with, and many others. That’s the disadvantage: the learning curve.

Sometimes it’s hard to implement a low-level feature with a high-level language, which is not designed for that task. On the other hand, it’s often so much easier to design a piece of user interface to be displayed in a browser with just a few lines of script rather than writing it almost from scratch in Java, C++, or whatever language is used for standalone applications.

In any case, that was a short introduction. Thanks for taking you time reading this post. Your valuable input to the OutputLogic.com is greatly appreciated.