Showing posts with label FPGA. Show all posts
Showing posts with label FPGA. Show all posts

Friday, December 20, 2019

FPGA


Implementation of random logic,Prototyping,Reconfigurable hardware,Hardware dedicated to solving one problem 

•         DSP 48 can perform mathematical functions such as:

•         Add/Subtract

•         Accumulate

•         Multiply

•         Multiply-Accumulate

•         Multiplexer

•         Barrel Shifter

•         Counter

•         Divide (multi-cycle)

•         Square Root (multi-cycle)

•         Can also create filters such as:

•         Serial FIR Filter (Xilinx calls this MACC filters)

•         Parallel FIR Filter

•         Semi-Parallel FIR Filter

•         Multi-rate FIR Filters

 

High-performance FPGA families 

 

  Virtex (220 nm)

     Virtex-E, Virtex-EM (180 nm)

     Virtex-II (130 nm)

   Virtex-II PRO (130 nm)

    Virtex-4 (90 nm)

     Virtex-5 (65 nm)

     Virtex-6 (40 nm)

    Virtex-7 (28 nm)

 

•      Low Cost Family

–    Spartan/XL – derived from XC4000

–    Spartan-II – derived from Virtex

–    Spartan-IIE – derived from Virtex-E

–     Spartan-3 (90 nm)

–     Spartan-3E (90 nm) – logic optimized

–     Spartan-3A (90 nm) – I/O optimized

–     Spartan-3AN (90 nm) – non-volatile,

–     Spartan-3A DSP (90 nm) – DSP optimized

–     Spartan-6 (45 nm)

–     Artix-7 (28 nm)

 

 

•      Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAs

•      Essentially a multiply-accumulate core with many other features

•      Now also in Spartan-3A, Spartan 6, Virtex 5, and Virtex 6

 

Embedded memories

 

 

The M4K memory blocks support the following modes:

 

v   Single-port RAM (RAM:1-Port)

v   Simple dual-port RAM (RAM: 2-Port)

v   True dual-port RAM (RAM:2-Port)

v   Tri-port RAM (RAM:3-Port)

v   Single-port ROM (ROM:1-Port)

v   Dual-port ROM (ROM:2-Port)

 

 

Change in Supply Voltages

 

Year         Technology (nm)  Core Supply Voltage (V)

1998                350                              3.3

1999                250                              2.5

2000                180                              1.8

2001                150                              1.5

2003                130                              1.2

2008                  65                             1.0

2009                  40                             0.9

2011                  28                             0.9    

 

 

 

Multiple Standards for High-Speed Serial Communication

 

Fibre Channel

InfiniBand

PCI Express (developed by Intel)

RapidIO

SkyRail (developed by MindSpeed Technologies)

10-gigabit Ethernet      

 

Field Programmable Gate Array


Simple’ Programmable Logic Blocks


Massive Fabric of Programmable Interconnects


Standard CMOS Integrated Circuit fabrication process as for memory chips (Moore’s Law)

 

         An FPGA contains a set of programmable logic gates and rich interconnect resources, making it possible to implement complex digital circuits

 

          To simplify the design of complex systems in FPGAs, there exists libraries of pre-defined complex functions and circuits (IP cores) that have been tested and optimized to speed up the design process.

 

         All done by CAD system (e.g. Xilinx, Quartus)

 

Chop up circuit into little pieces of logic

 

Each piece goes in a separate logic element (LE)

 

Hook them together with the programmable routing

 

                     Logic Functions implemented in Look Up Table LUTs.


            Flip-Flops. Registers. Clocked Storage elements.  


                   Multiplexers (select 1 of N inputs)

 

Huge Density of Logic Block ‘Islands’

 

1,000 … 100,000’s

 

                       in a ‘Sea’ of Interconnects

 

 

FPGA  Features

 

Large Complex Functions

Re-Programmability, Flexibility.

Massively Parallel Architecture

Processing many channels simultaneously cf MicroProcessor

Fast Turnaround Designs ,Standard IC Manufacturing Processes. Moore’s Law

Mass produced. Inexpensive. Many variants. Sizes. Features.

PP Not Radiation Hard L

Power Hungry ,No Analogue

 

LUT

 

LUT contains Memory Cells to implement small logic functions

Each cell holds ‘0’ or ‘1’ .

Programmed with outputs of Truth Table

Inputs select content of one of the cells as output

Larger Logic Functions built up by connecting many Logic Blocks together

Determined by SRAM cells

 

 

LUT AS DISTRIBUTED RAM

 

 

•      CLB LUT configurable as Distributed RAM

–     An LUT equals 16x1 RAM

–     Implements Single and  Dual-Ports

–     Cascade LUTs to increase RAM size

•      Synchronous write

•      Synchronous/Asynchronous read

–     Accompanying flip-flops used for synchronous read

•      When the CLB LUT is configured as memory, it can implement 16x1 synchronous RAM.  One LUT can implement 16x1 Single-Port RAM. Two LUTs are used to implement 16x1 dual port RAM.  The LUTs can be cascaded for desired memory depth and width.

•      The write operation is synchronous.  The read operation is asynchronous and can be made synchronous by using the accompanying flip flops of the CLB LUT. 

•      The distributed ram is compact and fast which makes it ideal for small ram based functions.

 

 

LUT AS SHIFT REGISTER

 

•      Each LUT can be configured as shift register

–     Serial in, serial out

•      Dynamically addressable delay up to 16 cycles

•      For programmable pipeline

•      Cascade for greater cycle delays

•      Use CLB  flip-flops to add depth

The LUT can be configured as a shift register (serial in, serial out) with bit width programmable from 1 to 16.

For example, DEPTH[3:0] = 0010(binary) means that the shift register is 3-bit wide. In the simplest case, a 16 bit shift register can be implemented in a LUT, eliminating the need for 16 flip flops, and also eliminating extra routing resources that would have been lowered the performance otherwise.

 

                 FAST CARRY LOGIC

 

Each CLB contains separate logic and routing for the fast generation of sum & carry signals

–     Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters

Carry logic is independent of normal logic and routing resources

                                       All major synthesis tools can infer carry logic for arithmetic functions

–     Addition (SUM = A + B)

–     Subtraction (DIFF = A - B)

–     Comparators (if A < B then…)

–     Counters (count = count +1)

 

CLB

 

 

Each slice contains two sets of the following:

–     Four-input LUT

•      Any 4-input logic function,

•      or 16-bit x 1 sync RAM (SLICEM only)

•      or 16-bit shift register (SLICEM only)

–     Carry & Control

•      Fast arithmetic logic

•      Multiplier logic

•      Multiplexer logic

–     Storage element

•      Latch or flip-flop

•      Set and reset

•      True or inverted inputs

•      Sync. or async. control

–     Each Virtexä-II CLB contains four slices

–     Local routing provides feedback between slices in the same CLB, and it provides routing to 
neighboring CLBs

–     A switch matrix provides access to general routing resources

–     Each slice has four outputs

–     Two registered outputs,                                two non-registered outputs

–     Two BUFTs associated 
with each CLB, accessible 
by all 16 CLB outputs

–     Carry logic runs vertically, 
up only

–     Two independent      carry chains per CLB

 

IOB

 

•      IOB provides interface between the package pins and CLBs

•      Each IOB can work as uni- or bi-directional I/O

•      Outputs can be forced into High Impedance

•      Inputs and outputs can be registered

–     advised for high-performance I/O

•      Inputs can be delayed

 

SELECTED IO

•      Allows direct connections to external signals of varied voltages and thresholds

–     Optimizes the speed/noise tradeoff

–     Saves having to place interface components onto your board

•      Differential signaling standards

–     LVDS (Low Voltage Differential Signaling), BLVDS, ULVDS

–     LDT (Lightning Data Transport)

–     LVPECL (Low Voltage Pseudo Emitter Coupled Logic)

•      Single-ended I/O standards

–     LVTTL, LVCMOS (3.3V, 2.5V, 1.8V, and 1.5V)

–     PCI-X at 133 MHz, PCI (3.3V at 33 MHz and 66 MHz)

–     GTL, GTLP

–     and more!

MEMORY AND CLOCKING

 

•      Distributed RAM and Block RAM

–     Distributed RAM uses the CLB resources

–     Block RAM is a dedicated resource on the device  (18-kb blocks)

•      Dedicated 18 x 18 multipliers next to block RAMs

•      Clock management resources

–     16 dedicated global clock multiplexers

–     Digital Clock Managers (DCMs)

 

DISTRIBUTED SELECT RAM

 

•      Uses a LUT in a slice as memory

•      Synchronous write

•      Asynchronous read

–     Accompanying flip-flops 
can be used to create 
synchronous read

•      Emulated dual-port RAM

–     One read/write port

One read-only port

 

 

 

BRAM

 

•      The Block RAM is true dual port, which means it has 2 independent Read and Write ports and these ports can be read and/or written simultaneously, independent of each other.

•      All control logic is implemented within the RAM so no additional CLB logic is required to implement dual port configuration.

•      Most efficient memory implementation

•      Dedicated blocks of memory

•      Ideal for most memory requirements

•      4 to 104 memory blocks

•      18 kbits = 18,432 bits per block (16 k without parity bits)

•      Use multiple blocks for larger memories

•      Builds both single and true dual-port RAMs

•      Block SelectRAM™ resources are dedicated resources on the silicon.

•      RAMs can be given an initial value. Many “initialization” attributes are associated with the Block SelectRAM resources:

•      INIT_xx: Numbered attributes (00 - 3F) that specify the initial memory data contents.  Each INIT_xx attribute is a 64-digit hex number.

•      INITP_xx: Numbered attributes (00 - 07) that specify the initial memory parity contents.  Each INITP_xx attribute is a 64-digit hex number.

•      INIT_A/INIT_B: Specifies the initial value of the RAM output latches after configuration.

•      SRVAL_A/SRVAL_B: Specifies the value of the RAM output latches after SSRA/SSRB is asserted.

•      INIT and SRVAL attributes are specified as hex numbers.

•      Up to 3.5 Mb of RAM in 18-kb blocks

•      Synchronous read and write

•      True dual-port memory

•      Each port has synchronous read and write capability

•      Different clocks for each port

•      Supports initial values

•      Synchronous reset on output latches

•      Supports parity bits

•      One parity bit per eight data bits

 

DUAL PORTS

 

 

•      Each port can be configured with a different data bus width

•      Provides easy data width conversion without any additional logic

•      Because the RAM  blocks are true dual port, each port can be configured for a different width. This example shows port A configured as 1K x 4 and port B configured as 256 x16. This feature can be used for applications requiring different bus widths  for two applications.

•      Note that the Altera FLEX 10K and ACEX 1K families do not have this feature, as they do not have true dual port capability.

 

 

 

ADDED ADVANTAGE OF TRUE DUAL PORT

 

 

•      Added advantage of True Dual-Port

–     No wasted RAM Bits

•      Can split a Dual-Port 16K RAM into two Single-Port 8K RAM

Simultaneous independent access

•      To access the lower RAM

–     Tie the MSB address bit  to Logic Low

•      To access the upper RAM

–     Tie the MSB address bit  to Logic High

 

TWO INDEPENDENT SINGLE PORT

 

•      Here, a single 4K bit memory block is split into two independent 2K bit Single-Port blocks. This  feature allows efficient utilization of memory bits. The upper 2K bit block is accessed by tying the ADDR11 bit to Vcc whereas the lower 2K bit block is accessed by tying it to GND instead.

 

 

DEDICATED MULTIPLER BLOCKS

 

•      18-bit two’s complement signed operation

•      Optimized to implement Multiply and Accumulate functions

•      Multipliers are physically located next to Block SelectRAM™ memory

 

GLOBAL CLOCK ROUTING RESOURCES

 

•      Sixteen dedicated global clock multiplexers

•      Eight on the top-center of the die, eight on the bottom-center

•      Driven by a clock input pad, a DCM, or local routing

•      Global clock multiplexers provide the following:

•      Traditional clock buffer (BUFG) function

•      Global clock enable capability (BUFGCE)

•      Glitch-free switching between clock signals (BUFGMUX)

•      Up to eight clock nets can be used in each clock region of the device

Each device contains four or more clock regions

 

 

 

DCM

•      Up to twelve DCMs per device

–     Located on the top and bottom edges of the die

–     Driven by clock input pads

•      DCMs provide the following:

–     Delay-Locked Loop (DLL)

–     Digital Frequency Synthesizer (DFS)

–     Digital Phase Shifter (DPS)

•      Up to four outputs of each DCM can drive onto global clock buffers

All DCM outputs can drive general routing

 

 

CLOCKED LOGIC:

 

Registers on outputs. CLOCKED storage elements.

Synchronous FPGA Logic Design, Pipelined Logic.

FPGA Fabric Pulse from Global Clock (e.g. LHC BX frequency)

 

FPGA Classification

 

Based on Functional Unit/Logic Cell Structure

v   Transistor Pairs

v   Basic Logic Gates: NAND/NOR

v   MUX

v   Look –up Tables (LUT)

v   Wide-Fan-In AND-OR Gates

 

Programming Technology

v   Anti-Fuse Technology

v   SRAM Technology

v   EPROM Technology 

•      Vary from vendor to vendor. All share the common property: Configurable in one of the two positions – ‘ON’ or ‘OFF’

•      Can be classified into three categories:

–     SRAM based

–     Fuse based

–     EPROM/EEPROM/Flash based

•      Desired properties:

•      Minimum area consumption

•      Low ON resistance; High OFF resistance

•      Low parasitic capacitance to the attached wire

•      Reliability in volume production

–     SRAM cells are used

•      As Look-Up Tables (LUT) to implement logic (as Truth Tables)

•      As embedded RAM blocks (for buffer storage etc.)

•      As control to routing and configuration switches

–      Advantages

•      Allows In-System Programming (ISP)

•      Suitable for Reconfigurable HW

–      Disadvantages

•      Volatile – needs power all the time / use PROM to

download configuration data

ANTIFUSE

v  Invented at Stanford and developed by Actel

v  Opposite to regular fuse Technology - Normally an open circuit until a programming current  (about 5 mA) is forced through it

v  Two Types:

1)      Actel’s PLICE [Programmable Low-Impedance Circuit Element] - A High-Resistance Poly-Diffusion Antifuse     

2)      QuickLogic’s Low-Resistance metal-metal antifuse [ViaLink] technology

ü   Direct metal-2-metal connections

ü   Higher programming currents reduce antifuse resistance

v  Disadvantages:

1)       Unwanted Long Delay


EPROM

 

 

•      EPROM Programming Technology

 

–     Two gates: Floating and Select

–     Normal mode:

•      No charge on floating gate

•      Transistor behaves as normal n-channel transistor

 

–     Floating gate charged by applying high voltage

•      Threshold of transistor (as seen by gate) increases

•      Transistor turned off permanently

–     Re-programmable by exposing to UV radiation

 

•      Not in-system re-programmable

•      Re-programming is a time consuming task

•      Two gates: Floating and Select

•      Functionally equivalent to EPROM; Construction and structure differ

•      Electrically Erasable: Re-programmable by applying high voltage

            (No UV radiation exposure!)

•       When un-programmed, the threshold (as seen by select gate) is negative!

                       

•      Re-programmable; In general, in-system re-programmable

•      Re-programming consumes lesser time compared to EPROM technology

•      Multiple voltage sources may be required

•      Area occupied is  twice that of EPROM!

 

 

 

All FPGAs contain the same basic resources

–     Slices (grouped into CLBs)

•      Contain combinatorial logic and register resources

–     IOBs

•      Interface between the FPGA and the outside world

–     Programmable interconnect

–     Other resources

•      Memory

•      Multipliers

•      Global clock buffers

•      Boundary scan logic

 

 

Configuring of FPGA

 

 

Millions of SRAM cells holding LUTs and Interconnect Routing

 

Volatile Memory. Lose configuration when board power is turned off.

 

Keep Bit Pattern describing the SRAM cells in non-Volatile Memory e.g. PROM 

 

Configuration takes ~ secs

 

FPGA Trends

 

 

 

State of Art is 40nm on 300 mm wafers

Top of range >500,000 Logic Blocks

>1,000 pins (Fine Pitched BGA)

Logic Block cost ~ 1$ in 1990

Today < 0.1 cent

 

Problems

 

Power. Leakage currents.

Design Gap

CAE Tool

 



FPGA RESET



                FPGA design where a reset acts as a synchronization signal that sets all the storage elements to a known state. In a digital design, designers normally implement a global reset as an external pin to initialize the design on power-up. The global reset pin is similar to any other input pin and is often applied asynchronously to the FPGA.


Designers can then choose to use this signal to reset their design asynchronously or synchronously inside the FPGA. An optimal reset structure will enhance device utilization, timing and power consumption in an FPGA.


 Devices in the Xilinx 7 series architecture contain eight registers per slice, and all these registers are D-type flip-flops. All of these flip-flops share a common control set.The control set of a flip-flop is the clock input (CLK), the active-high chip enable (CE) and the active-high SR port. The SR port in a flip-flop can serve as a synchronous set/reset or an asynchronous preset/clear port.

The RTL code that infers the flip-flop also infers the type of reset a flip-flop will use. The code will infer an asynchronous reset when the reset signal is present in the sensitivity list of an RTL process . The synthesis tool will infer a flip-flop with an SR port configured as a preset or clear port . When the SR port is asserted, the flip-flop output is immediately forced to the SRVAL attribute of the flip-flop. In the case of synchronous resets, the synthesis tool will infer a flip-flop whose SR port is configured as a set or reset port.In addition, you can initialize the flip-flop output to the value the INIT attribute specifies. The INIT value is loaded into the flip-flop during configuration and when the global set reset (GSR) signal is asserted.

The flip-flops in Xilinx FPGAs can support both asynchronous and synchronous reset and set controls. However, the underlying flip-flop can natively implement only one set / reset / preset / clear at a time. Coding for more than one set / reset / preset / clear condition in the RTL code will result in the implementation of one condition using the SR port of the flip-flop and the other conditions in fabric logic, thus using more FPGA resources.



 If one of the conditions is synchronous and the other is asynchronous, the asynchronous condition will be implemented using the SR port and the synchronous condition in fabric logic. In general, it’s best to avoid more than one set/reset/preset/clear condition. Furthermore, only one attribute for each group of four flip-flops in a slice determines if the SR ports of flip-flops are synchronous or asynchronous.


Regardless of the reset type used (synchronous or asynchronous), you will generally need to synchronize the reset with the clock. As long as the duration of the global reset pulse is long enough, all the device flip-flops will enter the reset state. However, the deassertion of the reset signal must satisfy the timing requirements of the flip-flops to ensure that the flip-flops transition cleanly from their reset state to their normal state. Failure to meet this requirement can result in flip-flops entering a metastable state. Furthermore, for correct operation of some subsystems, like state machines and counters, all flip-flops must come out of reset on the same clock edge. If different bits of the same state machine come out of reset on different clocks, the state machine may transition into an illegal state. This reinforces the need to make the deassertion of reset synchronous to the clock.



 For designs that use a synchronous reset methodology for a given clock domain, it is sufficient to use a standard metastability resolution circuit (two back-to-back flip-flops) to synchronize the global reset pin onto a particular clock domain. This synchronized reset signal can then initialize all storage elements in the clock domain by using the synchronous SR port on the flip-flops. Because both the synchronizer and the flip-flops to be reset are on the same clock domain, the standard PERIOD constraint of the clock covers the timing of the paths between them. Each clock domain in the device needs to use a separate synchronizer to generate a synchronized version of the global reset for that clock domain.



 Sometimes a portion of a design is not guaranteed to have a valid clock. This can occur in systems that use recovered clocks or clocks that are sourced by a hot-pluggable module. In such cases, the storage elements in the design may need to be initialized with an asynchronous reset using the asynchronous SR port on the flip-flops. Even though the storage elements use an asynchronous SR port, the deasserting edge of the reset must still be synchronous to the clock. This requirement is characterized by the reset-recovery timing arc of the flip-flops, which is similar to a setup requirement of the deasserting edge of an asynchronous SR to the rising edge of the clock. Failure to meet this timing arc can cause flip-flops to enter a metastable state and synchronous subsystems to enter unwanted states.


You can see the change in slack

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Description:

 

  Generate a new clock object from an existing physical clock object in the

  design.

 

  Clocks can be added to a design in one of three ways:

 

   *  Primary physical or virtual clocks defined with the create_clock command.

 

   *  Derived clocks defined with the create_generated_clock command

      generated from a primary physical clock.

 

   *  Derived clocks automatically generated by the Vivado Design Suite when

      a clock propagates to an MMCM/PLL/BUFR.

 

  You can also use the create_generated_clock command to change the name of   clocks that the Vivado tool has auto-derived from an MMCM/PLL/BUFR. In this   case, a new clock is not created, but an existing clock defined on the   specified source object is renamed to the provided name. This requires   -name and <object> to be specified, and supports the use of -source and/or   -master_clock to further identify the clock to rename when multiple clocks   exist on the source object. Refer to the Vivado Design Suite User Guide:

  Using Constraints (UG903) for more information on renaming auto-derived   clocks.

 

  Note: You cannot rename a clock that is already in use by other constraints   at the time of renaming. You must rename the clock prior to any other   appearance or use of the clock in an XDC file

 

  This command returns the name of the clock object that is created, or   returns an error if it fails.

 

Arguments:

  -name <arg> - (Optional) The name of the generated clock to create on the   specified object, or the name to assign to an existing clock on the   specified object. If no name is specified, the generated clock will be   given the name of the <object> it is assigned to. If assigned to multiple

  <objects>, the name will be the first object in the list.

 

  -source <arg> - (Optional) The pin or port of the master clock from which   to derive the generated clock. The master clock must be a previously   defined physical clock, not a virtual clock; but can be a primary clock or   another generated clock. If the source pin or port currently has multiple   clocks defined, the -master_clock option must be used to identify which   clock on the source is to be used to define the generated clock.

 

  -edges< <arg> - (Optional) Specifies the edges of the master clock to use   in defining transitions on the generated clock. Specify transitions on the   generated clock in a sequence of 1, 2, 3, by referencing the appropriate   edge count from the master clock in numerical order, counting from the   first edge. The sequence of transitions on the generated clock defines the   period and duty cycle of the clock: position 1 is the first rising edge of   the generated clock, position 2 is the first falling edge of the generated   clock and so defines the duty cycle, position 3 is the second rising edge   of the generated clock and so defines the clock period. Enclose multiple   edge numbers in braces {}. See the example below for specifying edge   numbers.

 

  -divide_by <arg> - (Optional) Divide the frequency of the master clock by   the specified value to establish the frequency of the generated clock   object. The value specified must be >= 1, and must be specified as an   integer.

 

  -multiply_by <arg> - (Optional) Multiply the frequency of the master clock   by the specified value to establish the frequency of the generated clock   object. The value specified must be >= 1, and must be specified as an   integer.

 

  -combinational - (Optional) Define a combinational path to create a   "-divide_by 1" generated clock.

 

  -duty_cycle< <arg> - (Optional) The duty cycle of the generated clock   defined as a percentage of the new clock period when used with the   -multiply_by argument. The value is specified as a percentage from 0.0 to   100.

 

  -invert - (Optional) Create a generated clock with the phase inverted from   the master clock.

 

  -edge_shift <arg> - (Optional) Shift the edges of the generated clock by   the specified values relative to the master clock. See the example below   for specifying edge shift.

 

  -add - (Optional) Add the generated clock object to an existing clock group   specified by <objects>.

 

  Note: -master_clock and -name options must be specified with –add   -master_clock <arg> - (Optional) If there are multiple clocks found on the   source pin or port, the specified clock object is the one to use as the   master for the generated clock object.

 

  Note: -add and -name options must be specified with -master_clock   -quiet - (Optional) Execute the command quietly, returning no messages from   the command. The command also returns TCL_OK regardless of any errors   encountered during execution.

 

  Note: Any errors encountered on the command-line, while launching the   command, will be returned. Only errors occurring inside the command will be   trapped.

 

  -verbose - (Optional) Temporarily override any message limits and return   all messages from this command.

 

  Note: Message limits can be defined with the set_msg_config command.

 

  <objects> - (Required) The pin or port objects to which the generated clock   should be assigned. If the specified objects already have a clock defined,   use the -add option to add the new generated clock and not overwrite any   existing clocks on the object.

 

Examples:

 

  The following example defines a generated clock that is divided from the   master clock found on the specified CLK pin. Since -name is not specified,   the generated clock is assigned the same name as the pin it is assigned to:

 

    create_generated_clock -divide_by 2 -source [get_pins clkgen/sysClk] fftEngine/clk

 

  The following example defines a generated clock named CLK1 from the   specified source clock, specifying the edges of the master clock to use as   transition points for the generated clock, with edges shifted by the   specified amount. In this example, the -edges option indicates that the   second edge of the source clock is the first rising edge of the generated   clock, the third edge of the source clock is the first falling edge of the   generated clock, and the eighth edge of the source clock is the second   rising edge of the generated clock. These values determine the period of   the generated clock as the time from edge 2 to edge 8 of the source clock,

  and the duty cycle as the percentage of the period between edge 2 and edge

  3 of the source clock. In addition, each edge of the generated clock is

  shifted by the specified amount:

 

    create_generated_clock -name CLK1 -source CMB/CLKIN -edges {2 3 8} \

    -edge_shift {0 -1.0 -2.0} CMB/CLKOUT

 

  Note: The waveform pattern of the generated clock is repeated based on the

  transitions defined by the -edges argument.

 

  This example creates two generated clocks from the output of a MUX, using

  -master_clock to identify which clock to use, using -add to assign the

  generated clocks to the Q pin of a flip flop, and using -name to define a

  name for the generated clock, since the object it is assigned to has

  multiple clocks assigned:

 

    create_generated_clock -source [get_pins muxOut] -master_clock M_CLKA \

    -divide_by 2 -add -name gen_CLKA [get_pins flop_Q]

    create_generated_clock -source [get_pins muxOut] -master_clock M_CLKB \

    -divide_by 2 -add -name gen_CLKB [get_pins flop_Q]

 

  The following example renames the automatically named clock that is derived

  by the Vivado Design Suite on the MMCM clock output:

 

    create_generated_clock -name CLK_DIV2 [get_pins mmcm/CLKOUT1]

 

See Also:

 

   *  check_timing

   *  create_clock

   *  get_generated_clocks

   *  get_pins

   *  report_clock

   *  set_clock_latency

   *  set_clock_uncertainty

   *  set_propagated_clock

 

Generated clocks are driven inside the design by special cells called Clock Modifying Blocks (for example, an MMCM), or by some user logic.

The XDC command "create_generated_clock" is used to create a generated clock object.

Syntax:

create_generated_clock  [-name <arg>] [-source <args>] [-edges <args>]

                        [-divide_by <arg>] [-multiply_by <arg>]

                        [-combinational] [-duty_cycle <arg>] [-invert]

                        [-edge_shift <args>] [-add] [-master_clock <arg>]

                        [-quiet] [-verbose] <objects>

 

This article discusses the common use cases of creating a generated clock.

For more information on create_generated_clock, please refer to (UG903).

Solution

Generated clocks are associated with a master clock from which they are derived.

The master clock can be a primary clock or another generated clock.

Please ensure you define all primary clocks first.

They are required for defining the generated clocks.

 

Use Case 1: Automatically Derived Clocks

 

For Clock Modifying Blocks (CMB) such as MMCMx, PLLx,IBUFDS_GTE2, BUFR and PHASER_x primitives, you do not need to manually create the generated clocks.

Vivado automatically creates these clocks, provided the associated master clock has already been defined.

You only need to create the primary clock that is feeding into the CMB.

The auto-generated clock names can be reported by the report_clocks command in the synthesized or implemented design so that you can use them in other commands or constraints.

 

It is possible to force the name of the generated clock that is automatically created by the tool.

See "Use Case 2: Renaming Auto-derived Clocks" below.

 

An auto-generated clock is not created if a user-defined clock (primary or generated) is also defined on the same netlist object, that is, on the same definition point (net or pin).

Vivado gives the following warning message when an existing primary or generated clock prevents auto-generated clock propagation:

 

Warning:[Timing 38-3] User defined clock exists on pin <pin_name> and will prevent any subsequent automatic derivation.

 

Automatically Derived Clock Example



The following automatically derived clock example is a clock generated by an MMCM.








XDC constraint:

create_clock -name clkin -period 10.000 [get_ports clkin]

 

The report_clocks command prints the following information:

 

Clock Period Waveform Attributes Sources

clkin 10.00000 {0.00000 5.00000} P {clkin}

cpuClk 10.00000 {0.00000 5.00000} P,G {clkip/mmcm0/CLKOUT}

Use Case 2: Renaming Auto-derived Clocks

It is possible to force the name of the generated clock that is automatically created by the tool.

The renaming process consists of calling the create_generated_clock command with a limited number of parameters.

 

create_generated_clock -name new_name [-source source_pin] [-master_clock master_clk] source_object

              

A single create_generated_clock command has to specify a unique auto-derived clock to rename.

A user-defined generated clock cannot be renamed.

 

 

Renaming Auto-derived Clock Example

 

Same example in Use Case 1:

XDC constraint:

create_clock -name clkin -period 10.000 [get_ports clkin]
#renaming auto-derived clock cpuClk
create_generated_clock -name user_clk [get_pins clkip/mmcm0/CLKOUT]

Then the report_clocks command prints the following information:

Clock Period Waveform Attributes Sources

clkin 10.00000 {0.00000 5.00000} P {clkin}

user_clk 10.00000 {0.00000 5.00000} P,G {clkip/mmcm0/CLKOUT}

......

Use Case 3: User Defined Generated Clocks

When no automatic generation occurs, you will need to manually create clock modifications.

For example, for a clock divider logic that consists of LUTs and FFs, Vivado is not aware of the period relationship between the source clock and the divided clock. 

As a result, a user-defined generated clock is required for the divided clock.

This type of clock divider is not recommended in an FGPA. We recommend using an MMCM or a PLL to divide the clock.

Specify the master source using the -source option.
This indicates a pin or port in the design through which the master clock propagates.
It is common to use the master clock source point or the input clock pin of a generated clock source cell.

 

User Defined Generated Clock Example

The primary clock drives a register divider to create a divide-by-2 clock at the register output.

Two equivalent constraints are provided below:

create_clock -name clkin -period 10 [get_ports clkin]

# Option 1: master clock source is the primary clock source point with a 'divide by' value of the circuit.

create_generated_clock -name clkdiv2 -source [get_ports clkin] -divide_by 2 [get_pins REGA/Q]

# Option 2: master clock source is the REGA clock pin with a 'divide by' value of the circuit.

create_generated_clock -name clkdiv2 -source [get_pins REGA/C] -divide_by 2 [get_pins REGA/Q]

 

Use Case 4: Forwarded Clock through ODDR

 

In the Source Synchronous application, the clock is regenerated in the source device and forwarded to the destination device along with data.

A common method is to use clock forwarding via a double data-rate register.

In the following example, the ODDR instance in the source device is used to generate the forwarding clock for the Source Synchronous interface.

A user-defined generated clock needs to be created for the forwarding clock in order to be used in the set_output_delay constraint for the Source Synchronous interface.

Example of Creating Generated Clock at Clock Output Port:


create_generated_clock -name fwd_clk -multiply_by 1 -source [get_pins ODDR_inst/C] [get_ports CLKOUT]


The generated clock can then be referenced in the set_output_delay command.

For more information on set_output_delay command, please refer to (UG903).

 

Use Case 5: Overlapping Clocks Driven by a Clock Multiplexer

When two or more clocks drive into a multiplexer (or more generally a combinatorial cell), they all propagate through and become overlapped on the fanout of the cell.

For this reason, you must review the CDC paths and add new constraints to exclude false paths due to overlapping.

The correct constraints are dictated by how and where the clocks interact in the design.

In some scenarios, user-defined generated clocks need to be created for the multiplexed clock in order to correctly constrain the CDC paths.

 

Multiplexed Clock Example:


If clk0 and clk1 only interact in the fanout of the multiplexer (FDM0 and FDM1), (i.e. the paths A, B and C do not exist), it is safe to apply the clock groups constraint to clk0 and clk1 directly.

set_clock_groups -logically_exclusive -group clk0 -group clk1


If clk0 and/or clk1 directly interact with the multiplexed clock (i.e. the paths A or B or C exist), then in order to keep timing for paths A, B and C, the constraint cannot be applied to clk0 and clk1 directly.


Instead, it must be applied to the portion of the clocks in the fanout of the multiplexer, which requires additional clock definitions.

In this case, two generated clocks are created at the Multiplexer output pin and paths crossing the generated clock domains are ignored.

create_generated_clock -name clk0mux -divide_by 1 -source [get_pins mux/I0] [get_pins mux/O]

create_generated_clock -name clk1mux -divide_by 1 -add -master_clock clk1 -source [get_pins mux/I1] [get_pins mux/O]

set_clock_groups -physically_exclusive -group clk0mux -group clk1mux

 have a design consisting of Clock Wizard IP(MMCM) with input clock 100MHz at "clk_in1". Now I generate 50MHz clock at output "clk_out1".
As per UG903(Page 88 and 89), Xilinx automatically derives constraints "create_generated_clock" for the clocks generated using PLL, MMCM etc.


But when I checked my design I cannot  see any "create_generated_clock" constraints defined automatically for the "clk_out1".

You only need to create_clock for the input port of MMCM, then the output clock of MMCM will be automatically generated. You don't need to create_generated_clock on the output of MMCM manually.

You can check the result of report_clocks to see the auto-generated clocks.

Ex: clk_pin_p is the input clock for MMCM, clk_rx_clk_core/clk_tx_clk_core is the output of MMCM. 

The following example shows the use of the multiple through points to

  define both a specific path (through state_reg1) and alternate paths

  (through count_3 or count_4), and writes the timing results to the

  specified file:

 

    report_timing -from go -through {state_reg1} -through { count_3  count_4 } \

    -to done -path_type summary -file C:/Data/timing1.txt

 

report_timing –from[get_pins clk_in_IBUF_BUFG_inst/ clk_in_IBUF_inst] –to[get_pins freq_cnt_reg[24]/R]

 




FPGA  Design Flow

 

During the micro-architecture or detailed design phase FPGA resource requirements shall be estimated. Module designers shall have “detailed view” of the design down to function/majorcomponent level for near-accurate estimates. At the end of this phase, exact FPGA part to be used shall be finalized from the chosen family.

Following are critical aspects that need to be considered during this phase:

1.      FPGA device Architecture: Detailed investigation and understanding of FPGA device architecture/capabilities including logic cells, RAMs, multipliers, DLL/PLL and IOs

2.      Module boundaries: All modules interfaces shall be on register boundary.

3.      Internal bus structure: A well defined internal point-to-point bus structure is preferred than routing all signals back and forth.

4.      Clocks: Clock multiplexing and gating shall be avoided and if required shall be done based on device capabilities

5.      Resets: Number of resets in the system shall be optimized based on dedicated reset routingresources available

6.      Register file: Instead of creating one common register file and routing register values to all modules; it is better to have registers wherever they are used. If needed even registers may be duplicated. It should be noted that though write path may be of multi-cycle path, but read path may not be. Also registers shall be implemented in RAM wherever possible

7.      Selection of memories/multipliers: The memory size requirement shall decide whether to use hard-macros or to build with logic. For small size memories, it is not at all preferred to map to large memory hard-macros, though it might take additional logic resources. The primary reason for this is hard-macro memory locations are fixed and placing driving/receiving logic next to memories is not always possible. Similarly, it is not advantageous to map small multiplier (such as 3x3) to an 18x18 hard- macro multiplier.

8.      Data/Control path mixing: Often it is advantageous to store control signals along with data bits in memories and pass-on to other modules. For example let us consider 16 data bits and 2 control bits to be transferred from one module to another through memory. These 18 bits can be stored as data bits in available block-memory of size say 1kx18 block memories. Also this method will be further advantageous if the hand-shake is asynchronous.

9.      Big multiplexer structures: It is not preferred to build very big multiplexer structures (say 256:1) especially for timing critical paths. Instead smaller multiplexers can be built, which are more controllable.

10.  High-level Floorplan: A high-level floorplan including IO planning shall be worked-out (as shown in Figure 1) based on the gate count and other macro estimates. Also spare area shall be planned for future/field upgrades. At this stage it is not necessary to fix the IO locations but it is necessary to fix the IO banks in FPGA. Having done the high level floorplan; the budgeted area shall be known to module level designers. Also interface module floorplan locations shall be known to the module level designers, which will enable them to further floorplan allocated area if necessary. Some of the high level floorplanning considerations are:

a.       Controlling congestion along with proximity

b.      Draw the data flow diagram of the design with the memories that are used to terminate the data paths and do module level area allocation

c.       Interdependent modules should be closer

d.      Module level area allocated shall be close to Macros which it is interfacing to

e.       Free area (rows and columns) between module area allocations, which will aid in inter module routing in full chip

f.       Clock resources and routing limitations if any

11.  Module output replication: Based on the initial floorplan each module output might have to be replicated if modules receiving this data are located in different corners of the chip.

12.  Best practices: RTL coding guidelines shall be passed on to module level designers.


RTL coding phase

Following are critical aspects which need to be considered during RTL coding phase:

1.      Logic delay: Though it may be adequate to maintain logic delay of around 50%, it is desirable to maintain high speed paths in the design lower than that, say to 20-30%. Usually there are abundant resources such as Flip Flops (normally 1 flip flop for each look-up table), RAMs, and Multipliers etc. Wherever it doesn’t affect throughput, additional pipeline stages can be introduced judiciously keeping in mind the routing congestion issues.

2.      Device mapping efficiency: The RTL code shall enable best FPGA mapping by exploring the device architecture. One such example is in Xilinx Virtex2 FPGA there is an additional 2:1 MUX (F5) between 2 LUTs with dedicated routes. If a 4:1 MUX is coded as single entity, it will map well in one slice with 2 LUTs and an F5 MUX. Instead if 4:1 MUX built with pipelining after 2:1 MUX, then it can’t be mapped to F5 MUX and additional slice is needed. Another example is long register based shift register can be mapped to SRL configuration of LUT, provided all these registers need not have reset.

3.      Fan-out: Though synthesis tools can do automatic fan-out control, manual control is needed especially for the signals interfacing to hard-macros, as tools will treat every thing in same manner and often they are black-boxes.

4.      Vendor specific structures and instantiations: Create hierarchy around them to give freedom to migrate from one technology to another.

5.      Macro interface: All the inputs/outputs of macros shall be registered due to their fixed locations.

6.      Gated clocks: Avoid gated clocks and use clock enables instead.

7.      Critical logic: Place critical logic in separate hierarchy

8.      Critical paths: Make sure that they are not crossing hierarchy of the block by registering all the outputs.

9.      Tri-state buffers: For low speed paths, it is desirable to use tri-state buffers to save logic cells

10.  Unused hard-macros: Unused RAMs can be used as register set or to map state machines coded as look up tables. This will also avoid large multiplexers in the read path. Also unused multipliers can be used as long shifters.

11.  False and multi-cycle paths: False and multicycle paths shall not be pipelined and shall be identified by design and pass on to synthesis tool.

12.  Trail synthesis and P&R: Each module level designer shall perform individual module level synthesis and P&R of the design with the given floorplan and optimize the RTL code while being developed. If the IO requirement of a module exceeds the device physical IOs, dummy logic can be added to demultiplex/miltiplex few-pins-to-more-pins and/or more-pins-to-few-pins using shift register structures and/or OR-gate structure as shown in Figure 2. Also as shown in this figure insert additional flip-flops on interfaces to selected module to other modules by leaving actual IO interfaces same. This will eliminate skewed timing results due to dummy logic and connections. Also black-box timing information shall be used during synthesis to avoid skewed timing results.

13.  Module level Floorplanning: With-in the given floorplan area, often it is desirable to do sub-module level floorplanning. In this submodule level floorplanning it is often necessary to do floorplan only for critical parts of the design. Also it is necessary to do individual synthesis compile of timing critical sub-modules being floorplanned which will prevent hierarchy loss (as shown in Figure 3), and there-by ineffcient placement.

14.  Logic compression: Though from area standpoint it is preferred to do maximum level packing of unrelated logic (for example using COMPRESSION with Xilinx flow), it will have adverse impact on timing. Thus unrelated logic packing level shall be set based on timing criticality of each sub-module.

15.  IO allocation: The respective module IO fixing shall be done based on IO ring pin sequence on the die rather than pin sequence on the package.

Chip level Synthesis phase

During the chip level synthesis phase, following information shall be collected from individual module designers:


1.      Area constraints with unrelated logic compression information


2.      Timing constrains including false and multicycle paths


3.      IO assignments



4.      Black-box timing information


5.      Synthesis compile hierarchy


6.      Timing critical sub-module information


Module level synthesis has to be carried out with the information gathered from designers. Mere meeting frequency at synthesis stage is not good enough as route estimates are inaccurate. Instead if logic delay achieved is 50% of the cycle time, we can say we have achieved possible best results out of synthesis and move on to further steps. 

The resource sharing and fan-out control options in synthesis tool can be enabled for non timing critical sub-modules. Whereas synthesis tool options such as register replication, fan-out control and retiming can be enabled for timing critical submodules. Thus in the chip top level synthesis compilation, all modules will be black-boxes. Automated push-button based physical synthesis has yielded only 10-15% overall improvement in performance after P&R. However there are physical synthesis tools (e.g. Synplify premier) which supports floorplanning at synthesis stage. However the methodology described in this paper is equally applicable to netlist based floorplanning or physical synthesis based design floorplanning.





VIVADO SUITE


Reduced design iterations

Advanced analysis and reporting at every design step
Shared design info throughout the flow to converge faster
Interactive cross-probing accelerates debug
Advanced P&R, 4x faster than competing solutions
Superior QoR, 1 speed grade over ISE


Vivado Provides Consistent Runtime Advantage with Greater Predictability and Higher Scalability

Up to 4x faster than alternative solutions
Vivado supports large designs where competition fails

Less variability with design size  more predictable run-times

Efficient memory utilization
– ½ the memory requirements
Only Vivado can route design with:
– High routing Complexity
– High utilization

Runtime advantage increases with:
– Routing Complexity
– Device Utilization

QoR Advantage

Up to 3 speed grades vs. Competitor’s
28nm solution

Team design or module reuse
– Implement modules out of context from top level design
– Iterate without overhead of the full design

Partial reconfiguration
– Provides system flexibility
– Reduces cost, size and total power

IP-XACT Packaged modules for reuse in new designs
– Reuse pre-verified placed & routed modules
– Accelerates design re-use and leverages your IP



Rapid ECO changes
– Incremental Implementation, 3x faster than recompile
– Post P&R edits with Vivado Device Editor

Vivado HLS - untimed algorithm to timed RTL

– Untimed (functional): C, C++, SystemC
– Timed (architectural): VHDL, Verilog, SystemC

Ideal for…
– Accelerating verification and implementation
• 10,000x faster simulation than RTL
• Explore multiple RTL architecture from the same C/C++
– Embedded designers interested in HW acceleration
– Best for applications with parallel processing
System Generator
– Implement DSP algorithms in a fraction of time of RTL
– Automatic RTL generation from Simulink® and MATLAB®
Comprehensive integration
– Supports AXI-4 and automatically packages the RTL for Vivado’s IP catalog

Rapid correct by construction IP integration
– Interface level connections speeds assembly
– IP parameter changes are propagated to ensure correct behavior
– Full and arbitrary hierarchy supports reuse and readability
– Automated generation of RTL for processor-based or non-processor based designs

Rapid verification and easy reuse

– Integrated with Vivado Logic Analyzer and Vivado Simulator for debug and analysis
– Integrated with Vivado IP Packager to build, share and reuse your own IP subsystem

Tightly integrated into Vivado IDE

– Easy cross probing between sources and waveforms and reports

Accelerated verification
– 3X faster than Xilinx ISE Simulator
– 2X less memory
– Accelerate verification even more with HW
co-simulation



Synthesis:
– Superior SystemVerilog support
– 3x faster runtime
– 15x with “quick-synthesis” option
– Easley mark nets for hardware debug with Vivado Analyzer


Timing  Analysis

 

Vivado IDE synthesis engine transforms the  RTL description into technology mapped  netlist.

With synth design net delay modelling is not very accurate; synth netlist should either meet timing or fail by a small amount before starting implementation.

There are three categories of constraints for synthesis:

RTL Attributes

- directives written in the RTL files (MARK_DEBUG, etc.)

Timing Constraints (XDC)


Following have real impact on synthesis

create_clock
create_generated_clock
set_input_delay
set_output_delay
set_clock_groups
set_false_path
set_max_delay
set_multicycle_path


TIMING ANALYSIS

 Timing Assertions Section

 Primary clocks

Virtual clocks

Generated clocks

Clock Groups

Input and output delay constraints

 

Timing Exceptions Section

 

False Paths

Max Delay / Min Delay

Multicycle Paths

Case Analysis

Disable Timing

 

Physical Constraints Section  located anywhere in the file, preferably before or after the timing constraints  or stored in a separate constraint file.

 Start with the clock definitions.

 The clocks must be created before they can be used by any subsequent constraints.

 Any reference to a clock before it has been declared results in an error and the corresponding constraint is ignored. This is true within an individual constraint file, as well as across all the XDC files (or Tcl scripts) in your design.

 The order of the constraint files matters. You must be sure that the constraints in each file do not rely on the constraints of another file.

 If this is the case, you must read the file that contains the constraint dependencies last.

 If two constraint files have interdependencies, you must either:

 • Merge them manually into one file that contains the proper sequence, or

 • Divide the files into several separate files, and order them correctly.

 

FPGA TIMING


Start with no timing constraints

Use IDE to view the clock networks

Create basic clocks

Review Timing reports for coverage

Open synthesized design .See schematic

Report clock networks.Click constraints

See unconstrained 

Before editing go for report_clocks

Edit timing constraints


 

Click generated clock



Description:

Create a generated clock object

 

Syntax:

create_generated_clock  [-name <arg>] [-source <args>] [-edges <args>]

                        [-divide_by <arg>] [-multiply_by <arg>]

                        [-combinational] [-duty_cycle <arg>] [-invert]

                        [-edge_shift <args>] [-add] [-master_clock <arg>]

                        [-quiet] [-verbose] <objects>

 

Returns:

new clock object

 

Usage:

  Name              Description

  -----------------------------

  [-name]           Generated clock name

  [-source]         Master clock source object pin/port

  [-edges]          Edge Specification

  [-divide_by]      Period division factor: Value >= 1

                    Default: 1

  [-multiply_by]    Period multiplication factor: Value >= 1

                    Default: 1

  [-combinational]  Create a divide_by 1 clock through combinational logic

  [-duty_cycle]     Duty cycle for period multiplication: Range: 0.0 to 100.0

                    Default: 0.0

  [-invert]         Invert the signal

  [-edge_shift]     Edge shift specification

  [-add]            Add to the existing clock in source_objects

  [-master_clock]   Use this clock if multiple clocks present at master pin

  [-quiet]          Ignore command errors

  [-verbose]        Suspend message limits during command execution

  <objects>         List of clock source ports, pins, or nets

 

Categories:

SDC, XDC

report_timing

if its ports go generic

if it is pins,do tcl for get_pins

and do report_timing  -from[get_pins     ..]-to[get_pins   ..]]

view contents of the report

Timing reports always start from primary clock propagate to the generated clocks and then on to the clock elements.

observe the  destination path clock timing.

See the requirements of the generated clock

See the destination  clock start of next clock edge and on to destination register

Slack is the required time – arrival time

Open the schematic of the netlist ,select clk_gen.window.zoom it

Observe the difference in schematic which is periodically enabled to generate the destination clocks.

In this case,generated clocks doesnot have the predefined relationship with the primary clocks clk pin

As a result,create a tcl command

create_generated clock

save the constraints

enter the tcl command report_clocks

observe the new generated clock included in the timing reports

click report timing summary

select new generated clk

see intraclk path,see the setup.double click any path to view the path properties

see the source clock path

source clock delay starts primary clock and propgate to generated clocks both automatically and manual generated clocks

see the destination clock path

starts from primary clock and propagated to generated clocks

close the properties.

 

set input_delay

report_timing –from[all_inputs]

see the reports

set_input_delay –clock[get_clocks –phyclk0] –max3$my_input

report_timing –from$my_input –setup

see the reports

see the slack and input delay

insert the below tcl

set_input_delay –clock[get_clocks –phyclk0] –min1$my_input

report_timing –from$my_input –hold

report shows actual slack and the input delay

In addition see the edit timing constraint, can add set_input_delay

 

set output_delay

 

report_timing –to[all_inputs]

see the reports

set_input_delay –clock[get clocks –sysclk0] $my_output

report_timing –from$my_input –hold

see the reports

see the actual slack and input delay

In addition see the edit timing constraint, can add set_output_delay

 

set clock groups

Open the synthesized design

Report clock interaction

Analyze the timing path from one clock domain to another clock domain

Report shows grid of clock interactions

Each grid is colored to indicate timing ,constraint status in inter clock path

If the two clock groups does not meet timing it is asynchronous

Clock frequencies are not integer multiples .Its impossible to find the relative fields between them.

 

report_clock_interactions –delay_type min_max –significant_digits 3 –name timing_1

see the wns and choose max

suppose if you have two constraints like clk_out1_clk_core, clk_out2_clk_core=>

 

join [get_timing_paths –from [get_clocks clk_out1_clk_core] –to[get_clocks clk_out2_clk_core]

-max_paths 200]

 

join [get_timing_paths –from [get_clocks clk_out2_clk_core] –to[get_clocks clk_out1_clk_core]

-max_paths 200]

 

Two clocks marked as asynchronous

Launch timing contraints-edit

Double click set clock_group

You need to specify two clock groups

Add the first clock group

Add the second by click + sign

Note asynchronous is chosen  by default.

Save the constarints

Report clock interactions

Observe the interclock path between clock grp1 and 2 is decalred blue confirming asynchronous

 

Set false path





report timing_summary

See the interclock path

See the setup

set_false_path –from[get_pins   ]-to[get_pins ]]

report_timing –from [get_pins   ]-to[get_pins ]]

if sklack is infinite false path is sucessful

if not check the false path

 

set multicycle path

report_timing

See the interclock path

See the setup

 




See the vioLated slack if any

See the requirement


      

What are the guidelines followed for good floorplan?

Steps followed with the aim of generating an optimum floor plan.

1.      Defining the core area using Specify Floor Plan Form

2.      Defining ports specified by top level engineer.

3.      Placing the macros inside the core area.

4.      Placing the Macros which are communicating with each other, together with help of Fly Lines,

5.      Color by Hierarchy and Data Flow Diagrams.

6.      Avoid the placement of Macros in front of ports.

7.      Arrange the Macros to get contiguous core area.

8.      Defining halos

9.      Defining Placement and Routing blockages.

3.5 Place & Route phase

During the chip level P&R phase, following information shall be collected from synthesis stage along with the netlists:


1.      Area constraints with unrelated logic compression information


2.      Timing constrains including false and multicycle paths


3.      IO assignments


4.      Timing critical sub-module information

As a first step in P&R process, fix the locations of top level primitives such as global clock buffers, DLL/PLL and IOs. After first pass P&R, usually on a complex design the route delays are high and there are many violator paths even after doing module level floorplanning.

 At this stage it is better to look for area regions where the route delays are higher rather than trying to solve timing issues one at a time. 

Top one or two timing violator areas shall be found and unrelated logic compression shall be reduced and/or area of such region shall be increased. The area left out during high-level floorplan will be helpful here. This process shall be continued till number of timing violators come down to few in number. 

At this stage it might be better to go for option of constructing relatively-placed-macros (RPM) for hard to meet timing paths, such as huge multiplexer, FIFO etc. 

The core generator tools (such as Coregen from Xilinx) can generate components with RPM. It is also possible to generate custom RPM structures based on floorplan by instantiating FPGA primitives. These macros will be placed as a group of elements in the defined area region.

 Especially this RPM structure method is extremely useful for short timing critical paths. After performing all these steps, there may be still last few timing violators. 

One way of handling these violators is to open the place & routed design in FPGA editing tools, and fix by moving elements around. This is often a tedious manual process, which might even impact other paths. It would be of great help if P&R tools can automate this process even under manual guidance by doing what-if analysis.

 Clock related issues:

 Distribution – generation (frequency synthesis)

Deskew

Multiplexing

Run­time programming

Domain crossing

 Clock Distribution

 Device split into halves (top/bottom) and clock regions (rectangular, uniform size)

 Different clock buffer types:

  BUFG can clock any FF within the top or bottom of device (top/bottom BUFGs) –BUFR clocks a single clock region.

 BUFMR clocks up to 3 clock regions –BUFH clock a horizontal clock spine within a region

 FPGA CLOCKING

 Clock generation (frequency synthesis) 

Uses “Clock Management Tiles” which consist of:

• PLL/DCM (Frequency Synthesis)

• MMCM (Advanced PLL with phase control) – Clock input from PCB must use “Clock capable pins” of FPGA

• Differential pairs

• Single­ ended clocks always use P side

 FREQUENCY SYNTHESIS

 Common use case: generate all design clocks from single input clock (crystal oscillator)

 CLOCK MUX

 Many applications require clock multiplexing:

 In­ circuit debugging (to avoid domain crossing) –

Dynamic frequency scaling

Clock redundancy (switch away from dead clock)

 FPGA clock multiplexers (2:1) implemented with BUFGMUX_CTRL primitive

Clock switch latency: max 3 clock cycles of the slower clock

 Glitchless output

 CLOCK RELATED CONSTRAINTS

 Timing constraints: –

Period: guides timing analysis with regard to a periodic signal (clock)

 NET “" TNM_NET = “"; • TIMESPEC = PERIOD “" HIGH/LOW %; – OFFSET IN/OUT – FROM:TO

 CLOCK DOMAIN CROSSING

 FPGAs have rapid average metastability recovery (ps). But recovery is unbounded (probability is nonnull that the metastable state will last a given time T)

  Design goal: achieve a desired MTBF, given the recovery parameters of the flip­flop.

 Synchronizer: Multiple FFs at the receiving end guard against metastability

  Number of FFs a function of desired MTBF and ratio of clock frequencies. Register output from source domain for more predictable timing (increased MTBF).

 2FF synchronizers work well when destination clock is faster than source clock (signal will remain stable for at least one destination clock cycle).  Fast­ to­ slow crossing requires closed ­loop synchronizer with handshake

 Closed­ loop: Control signal crosses into clock domain 2, then back into clock domain 1, and is checked against the reference (high latency)

  


                                        GTX TRANSCEIVER

 

The transceiver is referred to as GTX (Gigabit Transceiver), but other variants of transceivers, for instance GTH and GTZ, are to a large extent the same components with different bandwidth capabilities.

 

The 7 series FPGAs GTX and GTH transceivers are power-efficient transceivers, supporting line rates from 500 Mb/s to 12.5 Gb/s for GTX transceivers and 13.1 Gb/s for GTH transceivers.

 

The GTX/GTH transceiver is highly configurable and tightly integrated with the programmable logic resources of the FPGA.

 

The GTX/GTH transceiver supports these use modes:

 

• PCI Express, Revision 1.1/2.0/3.0

• 10GBASE-R

• Interlaken

• 10 Gb Attachment Unit Interface (XAUI), Reduced Pin eXtended Attachment Unit

Interface (RXAUI), 100 Gb Attachment Unit Interface (CAUI), 40 Gb Attachment Unit

Interface (XLAUI)

• Common Packet Radio Interface (CPRI™)/Open Base Station Architecture Initiative

(OBSAI)

• OC-48/192

• OTU-1, OTU-2, OTU-3, OTU-4

• Serial RapidIO (SRIO)

• Serial Advanced Technology Attachment (SATA)/Serial Attached SCSI (SAS)

• Serial Digital Interface (SDI)

• SFF-8431 (SFP+)                         

 

Overview

GTXs, which are the basic building block for common interface protocols (e.g. PCIe and SATA) are becoming an increasingly popular solution for communication between FPGAs.

As the GTX’ instance consists of a clock and parallel data interface, it’s easy to mistake it for a simple channel that moves the data to the other end in a failsafe manner.

A more realistic view of the GTX’ is a front end for a modem, with the possible bit errors and a need to synchronize serial-to-parallel data alignment at the receiver.

Designing with the GTX also requires attention to classic communication related topics, e.g. the use of data encoding, equalizers and scramblers.As a result, there are a few application-dependent pieces of logic that needs to be developed to support the channel:

The possibility of bit errors on the channel must be handled

The alignment from a bit stream to a parallel word must be taken care of (which bit is the LSB of the parallel word in the serial stream?)

If the transmitter and receiver aren’t based on a common clock, a protocol that injects and tolerates idle periods on the data stream must be used, or the clock difference will cause data underflows or overflows. Sending the data in packets in a common solution. In the pauses between these packets, special skip symbols must be inserted into the data stream, so that the GTX’ receiver’s clock correction mechanism can remove or add such symbols into the stream presented to the application logic, which runs at a clock slightly different from the received data stream.

Odds are that a scrambler needs to be applied on the channel. This requires logic that creates the scrambling sequence as well as synchronizes the receiver. The reason is that an equalizer assumes that the bit stream is uncorrelated on the average. Any average correlation between bit positions is considered ISI and is “fixed”.

It’s not uncommon that no bit errors are ever observed on a GTX channel, even at very high rates, and possibly with no equalization enabled. This can’t be relied on however, as there is in fact no express guarantee for the actual error probablity of the channel.

Enhanced Features in Virtex7 :

·         2-byte and 4-byte internal datapath to support different line rate requirements.

·         Quad-based LC tank PLL (QPLL) for best jitter performance and channel-based ring oscillator PLL.

 

·         Power-efficient, adaptive linear equalizer mode called the low-power mode (LPM) and a high-performance, adaptive decision feedback equalization (DFE) mode to compensate for high frequency losses in the channel while providing maximum flexibility.

·         RX margin analysis feature to provide non-destructive, 2-D post-equalization eye scan.

 

Clocking

The clocking of the GTXs is an issue in itself. Unlike the logic fabric, each GTX has a limited number of possible sources for its reference clock.

It’s mandatory to ensure that the reference clock(s) are present in one of the allowed dedicated inputs.

Each clock pin can function as the reference clock of up to 12 particular GTXs.

It’s also important to pay attention to the generation of the serial data clocks for each GTX from the reference clock(s).

It’s not only a matter of what multiplication ratios are allowed, but also how to allocate PLL resources and their access to the required reference clocks.

QPLL vs. CPLL

Two types of PLLs are availble for producing the serial data clock, typically running at severtal GHz: QPLLs and CPLLS.

The GTXs are organized in groups of four (“quads”). Each quad shares a single QPLL (Quad PLL), which is instantiated separately (as a GTXE2_COMMON). In addition, each GTX has a dedicated CPLL (Channel PLL), which can generate the serial clock for that GTX only.

Each GTX may select its clock source from either the (common) QPLL or its dedicated CPLL. The main difference between these is that the QPLL covers higher frequencies.

High-rate applications are hence forced to use the QPLL. The downside is that all GTXs sharing the same QPLL must have the same data rate (except for that each GTX may divide the QPLL’s clock by a different rate).

The CPLL allow for a greater flexibility of the clock rates, as each GTX can pick its clock independently, but with a limited frequency range.

Jitter

Jitter on the reference clock(s) is the silent killer of GTX links and it is often neglected by designers because “it works anyhow”, but jitter on the reference clock has a disastrous effect on the channel’s quality, which can be by far worse than a poor PCB layout.

 

As both jitter and poor PCB layout (and/or cabling) contribute to the bit error rate and the channel’s instability, the PCB design is often blamed when things go bad. This makes to believe that GTX links are extremely sensitive to every via or curve in the PCB trace, which is not the case at all. It is, on the other hand, very sensitive to the reference clock’s jitter.poorly chosen reference clock can be compensated for with a very clean PCB trace.

Jitter is commonly modeled as a noise component which is added to the timing of the clock transition, i.e. t=kT+n (n is the noise). Consequently, it is often defined in terms of the RMS of this noise component, or a maximal value which is crossed at a sufficiently low probability.

The treatment of an GTX’ reference clock requires a slightly different approach; the RMS figures are not necessarily a relevant measures. In particular, clock sources with excellent RMS jitter may turn out inadequate, while other sources, with less impressive RMS figures may work better.

Since the QPLL or CPLL locks on this reference clock, jitter on the reference clock results in jitter in the serial data clock. The prevailing effect is on the transmitter, which relies on this serial data clock; the receiver is mainly based on the clock it recovers from the incoming data stream, and is therefore less sensitive to jitter.

Some of the jitter – in particular “slow” jitter (based upon low frequency components) is fairly harmless, as the other side’s receiver clock synchronization loop will cancel its effect by tracking the random shifts of the clock.

On the other hand, very fast jitter in the reference clock may not be picked up by the QPLL/CPLL, and is hence harmless as well.

All in all, there’s a certain band of frequency components in the clock’s timing noise spectrum, which remains relevant:

The band that causes jitter components which are slow enough for the QPLL/CPLL to track and hence present on the serial data clock, and too fast for the receiver’s tracking loop to follow.

The measurable expression for this selective jitter requirement is given in terms of phase noise frequency masks, or sometimes as the RMS jitter in bandwidth segments. The spectral behavior of clock sources is often more difficult to predict: Even when noise spectra are published in datasheets, they are commonly given only for certain scenarios as typical figures.

8b/10b encoding

Several standardized uses of MGT channels (SATA, PCIe, DisplayPort etc.) involve a specific encoding scheme between payload bytes for transmission and the actual bit sequence on the channel.

Each (8-bit) byte is mapped to an 10-bit word, based upon a rather peculiar encoding table.

The purposes of this encoding is to ensure a balance between the number of 0′s and 1′s on the physical channel, allowing AC-coupling of the electrical signal. Also, this encoding also ensures frequent toggling between 0′s and 1′s, which ensures the proper bit synchronization at the receiver by virtue of the of the clock recovery loop (“CDR”).

As there are 1024 possible code words covering 256 possible input bytes, some of the excessive code words are allocated as control characters. In particular, a control character designated K.28.5 is often referred to as “comma”, and is used for synchronization.

The 8b/10b encoding is not an error correction code despite its redundancy, but it does detect some errors, if the received code word is not decodable. On the other hand, a single bit error may lead to a completely different decoded word, without any indication that an error occurred.

Scrambling

To put it short and concise: If an equalizer is applied, the user-supplied data stream must be random.

If the data payload can’t be ensured to be random itself (this is almost always the case), a scrambler must be defined in the communication protocol, and applied in the logic design.

Applying a scrambler on the channel is a tedious task, as it requires a synchronization mechanism between the transmitter and receiver.

It’s often quite tempting to skip it, as the channel will work quite well even in the absence of a scrambler, even where it’s needed. However in the long run, occasional channel errors are typically experienced.

The problem equalizers solve is the filtering effect of the electrical media (the “channel”) through which the bit stream travels.

Both cables and PCBs reduce the strength of the signal, but even worse: The attenuation depends on the frequency, and reflections occur along the metal trace. As a result, the signal doesn’t just get smaller in magnitude, but it’s also smeared over time.

A perfect, sharp, step-like transition from -1200 mV to +1200mV at the transmitter’s pins may end up as a slow and round rise from -100mV to +100mV. Because of this slow motion of the transitions at the receiver, the clear boundaries between the bits are broken.

Each transmitted bit keeps leaving its traces way after its time period. This is called Inter-Symbol Interference (ISI): The received voltage at the sampling time for the bit at t=0 depends on the bits at t=-T, t=t-2T and so on.

Each bit effectively produces noise for the bits coming after it.This is where the equalizer comes in. The input of this machine is the time samples of the bit at t=0, but also a number of measured voltage samples of the bits before and after it.

By making a weighted sum of these inputs, the equalizer manages, to a large extent, to cancel the Inter-Symbol Interference. In a way, it implements a reverse filter of the channel.

There are different techniques for training an equalizer to work effectively against the channel’s filtering. For example, cellular phones do their training based upon a sequence of bits on each burst, which is known in advance. But when the data stream runs continuously, and the channel may change slightly over time (e.g. a cable is being bent) the training has to be continuous as well. The chosen method for the equalizers in GTXs is therefore continuous.

 

The Decision Feedback Equalizer, for example, starts with making a decision on whether each input bit is a ’0′ or ’1′. It then calculates the noise signal for this bit, by subtracting the measured voltage with the expected voltage for a ’0′ or ’1′, whichever was decided upon.

The algorithm then slightly alters the weighted sums in a way that removes any statistical correlation between the noise and the previous samples.

This works well when the bit sequence is completely random: There is no expected correlation between any input sample, and if such exists, it’s rightfully removed. Also, the adaptation converges into a compromise that works on the average best for all bit sequences.

But what happens if there is a certain statistical correlation between the bits in the data itself?

The equalizer will specialize in reducing the ISI for the bit patterns occurring more often, possibly doing very bad on the less occurring patterns.

The equalizer’s role is to compensate for the channel’s filtering effect, but instead, it adds an element of filtering of its own, based upon the common bit patterns. In particular, note that if a constant pattern runs through the channel when there’s no data for transmission (zeros, idle packets etc.) the equalizer will specialize in getting that no-data through, and mess up with the actual data.

One could be led to think that the 8b/10b encoding plays a role in this context, but it doesn’t. Even though cancels out DC on the channel, it does nothing about the correlation between the bits. For example, if the payload for transmission consists of zeros only, the encoded words on the channel will be either 1001110100 or 0110001011. The DC on the channel will remain zero, but the statistical correlation between the bits is far from being zero.So unless the data is inherently random (e.g. an encrypted stream), using an equalizer means that the data which is supplied by the application to the transmitter must be randomized.

The common solution is a scrambler:

XORing the payload data by a pseudo-random sequence of bits, generated by a simple state machine. The receiver must XOR the incoming data with the same sequence in order to retrieve the payload data. The comma (K28.5) symbol is often used to synchronize both state machines.

In GTX applications, the (by far) most commonly used scrambler is the G(X)=X^16+X^5+X^4+X^3+1 LFSR, which is defined in a friendly manner in the PCIe standard

 TX/RXUSRCLK and TX/RXUSRCLK2

Almost all signals between the FPGA logic fabric and the GTX are clocked with TXUSRCLK2 (for transmission) and RXUSRCLK2 (for reception). These signals are supplied by the user application logic, without any special restriction, except that the frequency must match the GTX’ data rate so as to avoid overflows or underflows. A common solution for generating this clock is therefore to drive the GTX’ RX/TXOUTCLK through a BUFG.

 

The logic fabric is required to supply a second clock in each direction, TXUSRCLK and RXUSRCLK (without the “2” suffix). These two clocks are the parallel data clocks in a deeper position of the GTX.

The rationale is that sometimes, it’s desired to let the logic fabric work with a word width which is twice as wide as the actual word width. For example, in a high-end data rate application, the GTX’ word width may be set to 40 bits with 8b/10b, so the logic fabric would interface with the GTX through a 32 bit data vector. But because of the high rate, the clock frequency may still be too high for the logic fabric, in which case the GTX allows halving the clock, and applying the data through a 80 bits word. In this case, the logic fabric supplies the 80-bit word clocked with TXUSRCLK2, and is also required to supply a second clock, TXUSRCLK having twice the frequency, and being phase aligned with TXUSRCLK2. TXUSRCLK is for the GTX’ internal use.

A similar arrangement applies for reception.Unless the required data clock rate is too high for the logic fabric (which is usually not the case), this dual-clock arrangement is best avoided, as it requires an MMCM or PLL to generate two phase aligned clocks. Except for the lower clock applied to the logic fabric, there is no other reason for this.

 

Word alignment

 

On the transmitting side, the GTX receives a vector of bits, which forms a word for transmission. The width of this word is one of the parameters that are set when the GTX is instantiated, and so is whether 8b/10b encoding is applied. Either way, some format of parallel words is transmitted over the channel in a serialized manner, bit after bit. Unless explicitly required, there is nothing in this serial bitstream to indicate the words’ boundaries. Hence the receiver has no way, a-priori, to recover the word alignment.

The receiver’s GTX’ output consists of a parallel vector of bits, typically with the same width as the transmitter. Unless a mechanism is employed by the user logic, the GTX has no way to recover the correct alignment. Without such alignment, the organization into a parallel words arrives wrong at the receiver, and possibly as complete garbage, as an incorrect alignment prevents 8b/10b decoding (if employed).It’s up to the application logic to implement a mechanism for synchronizing the receiver’s word alignment.

There are two methodologies for this: Moving the alignment one bit at a time at the receiver’s side (“bit slipping”) until the data arrives properly, or transmitting a predefined pattern (a “comma”) periodically, and synchronize the receiver when this pattern is detected.

Bit slipping is the less recommended practice, even though simpler to understand. It keeps most of the responsibility in the application logic’s domain.The application logic monitors the arriving data, and issues a bit slip request when it has gathered enough errors to conclude that the alignment is out of sync.

 

However most well-established GTX-based protocols use commas for alignment. This method is easier in the way that the GTX aligns the word automatically when a comma is detected (if the GTX is configured to do so). If injecting comma characters periodically into the data stream fits well in the protocol, this is probably the preferred solution. The comma character can also be used to synchronize other mechanisms, in particular the scrambler (if employed).

Comma detection may also have false positives, resulting from errors in the raw data channel. As these data channels usually have a very low bit error probability (BER), this possibility can be overlooked in applications where a short-term false alignment resulting from a false comma detected is acceptable. When this is not acceptable, the application logic should monitor the incoming data, and disable the GTX automatic comma alignment through the rxpcommaalignen and/or rxmcommaalignen inputs of the GTX.

Tx buffer, to use or not to use.The Tx buffer is a small dual-clock (“asynchronous”) FIFO in the transmitter’s data path + some logic that makes sure that it starts off in the state of being half full.The underlying problem, which the Tx buffer potentially solves, is that the serializer inside the GTX runs on a certain clock (XCLK) while the application logic is exposed to another clock (TXUSRCLK).The frequency of these clocks must be exactly the same to prevent overflow or underflow inside the GTX. This is fairly simple to achieve.Ensuring proper timing relationships between these two clocks is however less trivial.

There are hence two possibilies:

Not requiring a timing relationship between these clock (just the same frequency). Instead, use a dual-clock FIFO, which interfaces between these two clock domains. This small FIFO is referred to as the “Tx buffer”. Since it’s part of the GTX’ internal logic, going this path doesn’t require any additional resources from the logic fabric.

Make sure that the clocks are aligned, by virtue of a state machine. This state machine is implemented in the logic fabric.The first solution is simpler and requires less resources from the FPGA’s logic fabric. Its main drawback is the latency of the Tx buffer, which is typically around 30 TXUSRCLK cycles.

While this delay is usually negligible from a functional point of view, it’s not possible to predict its exact magnitude. It’s therefore not possible to use the Tx buffer on several parallel lanes of data, if the protocol requires a known alignment between the data in these lanes, or when an extremely low latency is required.

The second solutions requires some extra logic, but there is no significant design effort: This logic that aligns the clocks is included automatically by the IP core generator on Vivado 2014.1 and later, when “Tx/Rx buffer off” mode is chosen.

Only note that gtN_tx_fsm_reset_done_out may take a bit longer to assert after a reset (something like 1 ms on a 10 Gb/s lane).

 

Rx buffer

 

The Rx buffer (also called “Rx elastic buffer”) is also a dual-clock FIFO, which is placed in the same clock domain gap as the Tx buffer, and has the same function. Bypassing it requires the same kind of alignment mechanism in the logic fabric.

 

As with its Tx counterpart, bypassing the Rx buffer makes the latency short and deterministic. It’s however less common that such a bypass is practically justified:

While a deterministic Tx latency may be required to ensure data alignment between parallel lanes in order to meet certain standard protocol requirements, there is almost always fairly easy methods to compesate for the unknown latency in user logic. Either way, it’s preferred not to rely on the transmitter to meet requirements on data alignment, and align the data, if required, by virtue of user logic.

 sysclk_in must be stable when the FPGA wakes up from configuration. A state machine that brings up the transceivers is based upon this clock. It’s referred to as the DRP clock in the wizard 

It’s important to declare the DRP clock’s frequency correctly, as certain required delays which are measured in nanoseconds are implemented by dwelling for a number of clocks, which is calculated from this frequency.

In order to transmit a comma, set the txcharisk to 1 (since it’s a vector, it sets the LSB) and the value of the 8 LSBs of the data to 0xBC, which is the code for K.28.5.

 

 1. How AES Encryption work ?

2. What is yours contribution on Encryption algorithm ? How you modify it and test it?

3. Explain the tree of encryption algorithm?

4. How Self manuring system work ?

5. What is your contribution ?

6. How DAC works?

7. Which FPGA you worked on?

8. Is it sram based or fused based ?

9. can you define the component in one cell of ARTIX 7 FPGA ?

10. Xilinx tool Flow ( ISE & VIVADO) ?

11. How you debug the board ( techniques) ?

12. How ILA works ?

13. various type of constraints ?

14. write counter program in VHDL ?

15. How you debug high intense IO board ?

 1. If there are 4 inputs in decoder..no of output lines?

Application of decoder.

2. Fpga design flow in detail.

3. Signal vs variable

4. Fsm types.. difference between the two with block diagram

5. No of inputs in 4 bit parallel adder

6. Full adder

7. Buffer using XNOR

8. Setup and hold time. 

9. If a number 101 is left shifted 2 times, what is the value of new number in decimal system. Ans. 20 (10100)

10.  Difference between block ram and distributed ram

11. Difference between flip flops and latch

12. Libraries used in vhdl...why ieee.std_logic library is used?

13. UART protocol

      1)      What are the different kinds of memory available in an FPGA

A: Distributed RAM, Block Ram and optionally Ultra RAM

 2)       Name some resources generated in a netlist

A: LUTs, Flipflops, DSPs, BRAMS, LUT RAMS, URAM,I/O etc

 3)       Name some debugging tools you have used in FPGA

A: ILA, Chip scope, VIO

 4)      Name some Verilog coding constructs that are not synthesisable

A: Time delay, initial, fork/join, force and release

 5)      How do you fix setup time violations?

A:  Registering, reduce the clock frequency, user stronger cells to drive, adjust skew of the violating path etc

 6)      How do you fix setup time violations?

A: Insert cells, Reduce drive strength of the cells, Skew the clock etc

 7)      How does an IF statement in a VHDL/Verilog code manifest in to real hardware

A: MUX/ LUT

 8)      How would you design Clock domain crossing  (CDC)

A: Using FIFO, Synchronisers

 9)       What happens when you don’t add a default statement to a “Case” construct

A: Inferred Latch is generated

 10)   Name few signals in AXI-stream

A: T-valid, T-data, T-ready, T-keep, T-last ,T-Id etc

 11)      Difference between AXI4 / AXI-lite and  AXI stream

A:AXI4/AXI-lite are Memory mapped and AXI stream does not access memory location

 12)   Maximum burst size supported by AXI -4 MM?

A: 256

 13)   Name clocking resources in an FPGA?

A: MMCM , PLL

 14)    What should the value of Worst negative slack be for a design to pass timing constraints

A: greater than zero

 15)   When a slave does not give T-ready in an AXI stream communication what is the expected behaviour of master

A: it should retain the value keeping T-valid high

 16)    What memory does the ILA use for capturing the packets

A: BRAM

 17)   What does clock gating do in an FPGA

A: it reduces power consumption of logic cells

 18)   Can a CLB be configured as RAM? If yes what kind of RAM?

A: yes, Distributed RAM

 19)   What are the different channels in an AXI -4 interface

 A: Write Address, Write data, read address, read data, Write response

      20)   Can a slave stream data to Master in an AXI-stream?

A: NO

21) How ILA works  ? wt are resource will it ?

22) Wt is  MMCM and PLL ?

23) How to reduce utilization OF FPGA?

24) Types of Constraints ?

25) Wt is AXI ?

26) Difference   between DPS and LUT ?

27) How much Clock Frequency u worked ?

28) what are the complexity u faced during FPGA Design ?

29) How to resolve Hold time Violation  ?

30) wt is MMCM and PLL ?

FPGA  architecture

VHDL /verilog code.

 Interfaces (such as PCIE, DDR3..)

Embedded ARM

What do you mean by JTAG OR USB ?

Write an RTL Block that implements 4x4 matrix multiplication. Then they asked me to optimize the critical path. Then check if we can reduce the chip area using less number of Multipliers.

How do you optimize your ASIC/FPGA design

What is speed grade and how do you select FPGA as per requirements?

What is the maximum possible speed achievable for a given device?

How do you code to reduce power in FPGA design

Answers

Avoid reset for FPGA

Clock Gate

Use synchronous design

Avoid over constraining

Reduce Device temperature (cooling solution)

Use clk_en and control enable for all Memory

Use LUT for smaller memory. BRAM takes more power

What is clock gating and how do you do it in FPGA

What is Clock domain crossing?

Why 2 blocks may need to work on different clocks?

What problems may arise due to clock domain crossing of signals?

How to resolve issues arising because of  clock domain crossing ?

How do you manage multiple clocks and how do you route them?

CDC tools can help this like Spyglass, etc. But asynchronous transfers must be handles carefully in design and later they can be assigned false path for the tool to go easy on compilation

How do you do IO planning and can you explain few types of IO types?

How do add debug probes for chipscope and what are the benefits inserting it and inferring it?

Answers:Inferring it along with your RTL gives you freedom of debugging whenever required. Adding it as a probe requires you to sort the netlist file and break up your head in finding the net names. Sometimes some logic needs to be always put on JTAG for status/debug.

How do you program multiple FPGA images in flash and load?

How do you control reset logic?

1.          FPGA used in my project. Its Architecture

2.          What is LUT?

3.          Difference between BRAM and Distributed RAM

4.          Difference between FIFO and RAM

5.          If a FIFO has 30 MHz clock for writing and 20 MHz for reading, is the reading operation synchronous/ asynchronous.

6.          FPGA design flow

7.          Metastability

8.          What will the debugging process, if the functional simulation is correct.

9.          Will the test bench for functional simulation and GLS simulation be same?

For FPGA, global reset is sufficient. Use async reset for internal logic and sync that reset in main clk (if they are in same clock, else use accordingly).Always assert the reset asynchronously and de-assert synchronously with clock

How do you do STA?

1.       Setup time and Hold time, Equations foe these and how to fix them in case of violation

2.       What is skew?

3.       No. of output bits after performing 8*8

4.       Blocking and Non-blocking outputs

5.       Verilog code for 2:1 MUX

6.       Implement AND gate and XOR gate using 2:1 MUX

7.       Half Adder and Full Adder with equations. Implementing multiplier using adder

8.       FPGA Architecture

9.       CDC

10.   Realization of 3 input AND gate using 4 input LUT

What are the timing constraints that you do for a typical design having synchronous and asynchronous logic?

How does the RTL logic convert to logic gate, say a comparator or counter, please describe

Can you write a code in RTL for a debounce logic or 9-bit counter?

RTL coding for FPGA primitive components, what are the primitive components and what have you used.

What are the following?BUFGMUX, IBUF,OBUF?

How can we find the max clock for a given FPGA program?

What is Metastability and what are its effects?

What is setup and hold time?

What is pipelining?

Why do FPGAs have dedicated clock pins? What's so special about them?

Describe a design that you think that will best fit an FPGA.

Describe a design that you think that will best fit a Microcontroller.

What are the difficulties in asynchronous FIFO design?

Differences between FPGA and ASICs, clock and reset schemes

A signal derived in clock domain A is used in a sequential process in clock domain B. What can potentially happen if you use the signal directly in clock domain B? What is the proper way to handle the signal before using it in clock domain B?

Why can't we use a Clock Source directly rather using FF Dividier circuits which also needs a clock source for its operation??

Internal workings of FPGA's such as block ram, dsp, logic cell, LUT

What is the most basic way of meeting or checking to see if you are meeting timing?

Tell me some of constraints you used and their purpose during your design?

1.       Have you worked on HAPS prototype board?

2.       Asked about my project.

3.       What type of the issues you have faced in your project?

4.       How can you reduce area if the all FPGA resources are used?

5.       Asked about FPGA structure

6.       How will you debug your design if there is no any error/timing violation, but your design is not working.

7.       What is difference between Function and Task?

8.       Do you know VHDL and Verilog? Asked 1-bit latch Verilog code.

What are different types of FPGA programming modes? What are you currently using ?how to change from one to another?

Can you list out some of synthesizable and non synthesizable constructs?

Can you draw general structure of fpga?

Difference between FPGA and CPLD?

MMCM Vs. DCM

What is slice,clb,lut?

What is FPGA you are currently using and some of main reasons for choosing it?

Draw a rough diagram of how clock is routed through out FPGA?

How many global buffers are there in your current FPGA, what is their significance?

What is frequency of operation and equivalent gate count of  project?

Compare PLL & DLL.

1.       How can you resolve routing issue?

1.       Implementing f =(a^b) & c using 4:1 Mux.

2.       Sine multiplier.

3.       How many output bits we will get if we multiply 8 bits with 8 bits?

4.       Highest and lowest value of n bit 2’s complement number.

5.       Verilog code for active low asynchronous set, en (high priority), reset(synchronous).

6.       Dual port RAM code.

7.       ROM, PAL.

8.       Design ROM using 6-LUT.

 

2.       What is difference between Function and Task?

3.       What is the purpose of DSP 48 block?

4.       Xilinx ISE flow

1.       Blocking and non-blocking assignment output.

2.       Edge detector

3.       Set up and hold related questions like when we get violations, we can remove, hold depends on frequency or not.

4.       Verilog code for 4:1 Mux.

5.       RAM Verilog code with writing has the highest priority.

6.       How to flush memory to default /reset value in a single clock.

7.       In your design which type of the reset signals are used? Why?

8.       Why CDC is required? CDC techniques.

9.        What is MMCM in FPGA?

Suggest some ways to increase clock frequency?

What is LV and why do we do that. What is the difference between LVS and DRC?

Clock management blocks in FPGAs

What are the transceiver types and speeds in Xilinx FPGAs?

Why are their hard macros in FPGAs? What are the downsides of including them?

Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? (may not even be true anymore)

What are some of the basic differences between FPGA families?

Verilog 2001 vs 95

 Latches

 clock gatimg

 resource sharing

 operator overloading

operator balancing

power gatimg

 flip flop vs latches

 counters

registers

memories

 fifo

 fifo depth

 meta stability resolve of it

overlap

 skew 

 slack

 setup

hold  fix ‘

encoder

 decoder

 throughput latency calculation

bufg

power reduction in rtl

Fsm verilog mealy moore

Clock schemes

 pattern gen

defparam

 local parameter

 parameter

dynamic ram sram and its types.memory calculation

.shift register types .all digital mux encoder flip conversions race race around glitches static hazard and dynamic hazards

1010 11001 overlap or non overlap

Booth multiplier verilog

Full subt adder carry

Mealy conversion to moore.

Booth encoding

For loop generate while loop forever

: Encoding schemes .clocking schemes. Advantages

Save adder

Verilog rtl synthesis

Task and functions

Mcp false path sta dta

: Generate statements for multiplication

Parity error check and correction clock crossing in verilog verilog renerta web and tidbits fully

Setup and pulse width

Face detection of verilog in Xilinx

Neural network in fpga

Steps for handover protocol on FPGA using C

How can I physically convert high speed camera/image sensor to a fpga

No of frames and frame length in Spartan 6,virtex7,kintex7

Xilinx system generator hardware co-simulation compilation problem

16 bit multiplier using xilinx’

Semicustom design without FPGA Or CPLD on Xilinx

Face recognition and speech  recognition in FPGA

how to write HDL code, to compare images. For example, i need to compare two 512x512 image, If i compared pixel by pixel with its value, how to do?

Verilog to detect circles

Verilog for EMD algorithm

Generate variable frequency sinusoidal signal using Xilinx system generator’

FPGA Signal integrity for SERDES

 FPGA issues in Timing

 SPI active edge

 Version control

 Board bring up

 FPGA partitioning

 ARM Modes

 Which mode arm will boot up? Why this mode. uses?

 Interrupts

 Booting procedures

 When and where the modes can be used?

 How ARM boots?

 How interrupt handling while boots?

 Difference  between NAND and NOR FLASH

 How C code runs after process come out of reset

 Pin mux in FPGA

 Gated logic implementation  in FPGA

 Hard IP in FPGA worked

 Pull up RESISTORS  in FPGA

 pull-up built into the I/O is not sufficient and for sure you need external pullup resistors.  

 Tools used in bitgen

 I2C pin mapping in FPGA

USB multiplexer via FPGA

Possible signal integrity problem  in 100mhz clock

4 core processor on FPGA

Relay control in fpga

Routing through TCL

Reset FPGA by PLL

PS to PL interrupts.

Generate saif file in ISIM or MODELSIM simulation’

Simulate sdf file in modelsim

Good scaling scheme for radix FFT on FPGA

Copy contents of array in memory editor Xilinx Isim

Miller Decoder state machine

Update .mif file without compiling TCL

Generate 4 phase shifted clock signals

Implement ADC in FPGA where FPGA runs at 100MHz clock.

Interface SRAM to microcontroller using FPGA.

Clock skew problem in serdes

Interface of FPGA to DSO

Data string length to be send out from FPGA.

Uart TX signaling from another processor in FPGA

How to read back the configuration memory of SRAM based FPGA for SEU mitigation?

Read/Write to Virtex7 through USB/jtag port.

Generate vcd from .v file and add it in synthesis to perform power analysis.

configure cordic divider in  Xilinx system generator

 What are the issues if the duty cycle of the clock in a digital ckt is changed from 50%?

  Challenges faced while transfer of data between 2 different clock domains

   DFF with 1 delay.what happens?

 If interface to FPGA and uart are same but with different sources will it affects? 

If my receiver clock having some tolerance with transmitter clock will it affect? 

If I2C is replaced by APB What happens?

 Analog IO pad and digital IO pad? 

What are the different ways for optimizing a timing path in the VLSI design? 

Interface pins of uart to fpga 

Interface pins of usb to fpga 

Interface pins of DDR,PCI to FPGA 

DDR3,DDR4,USB,PCI

 Constraints after synthesis or before synthesis

 Floorplan,placement,routing methods

 Script for basic clock gating

 ASIC to FPGA conversion and vice versa

 Ethernet interface to FPGA

 Ethernet frames

 TCP/IP protocol

 Axi protocol all

 AMBA protocol all

 Qualify STA of a chip

 Factors affecting delay and skew

Boot Linux on zynq fpga

 Detect whether transparent latch is inferred during the elaboration

 USB transfer types

CDC Strategies

100gb ethertnet

 Usb for mobile

 SRIO

 CPRI

 Images ,audio and video algorithms in FPGA

 Floating point in fpga

 ELF vs HDF

 AXI tapeout to peripheral

 USB

 PETALINUX COMPILE

 UART IN ZED BOARD

 LVDS STDS

 LVDS PORT MAP

 PCS,PMA IN GTX

 USB 3.0 validation

 Link traing for pci

 AURORA IP

 Difference between root complex and endpoint

 ARM processor bringup

 Design margin

 Construct memory using LUT

 64X1 MUX USING  LUT

 Dsp48

 Maximum frequency of the design

 Maximum frequency Fmax

 ADC interface to FPGA

 DRC in vivado

 DRP in vivado

Axi tapeout to peripheral

Video IP

Pixel size,image resolution in video

Testbench of 100Mhz with 1 Mhz clk reset 100ns

Desugn for high throughput

Latency at low end

Eyescan flow diagram

RJ45 pins to FPGA

In I2C if master request is to be read and if it fails .what happens?

FSBL

Cpu interfaces with FPGA

Banking rule of FPGA

Which buffer used for violation?

If 60ps is used,how to fix violation?

ARM addeessing modes

AWLEN

10 mhz and 100mhz working with single pulse

GATE COUNT in vivado

In CLB,GATE COUNT

Fault simulation

Register balancing

RAM extraction

Shift register extraction

Register duplication

Thooughput and latency calculations in design

In GTX if transmitter frequency is of high w r to receiver,..what happens?

Differntial clock to a single clock in vc707

 Multipliers used in 64 coefficients

FIR BLOCK

Ahb Timing Diagram

AHB block

32 bit reg to AHB.RTL CODE

APB Timing diagram

Sdk environment

Metastability Fix

Metastability waveform?

A circuit give,what is the hardware?

DDR interfaces,PCIE,USB,SERDES

Flip flop as 4 bit counter

High speed Transceiver

DLL,PLL,DPLL

ADC to FPGA

DRC

 In which stage of the design flow we get an idea about the false paths and multicycle paths in the design?

Which tool will report them?

In RTL level, we are left with the logic of the design only and we will have no idea about how the tool will synthesize different paths in the design. The tool will synthesize the design according to it's algorithm. From thousands of R2R paths the tool synthesize, how can we know that which all are false paths or multicycle paths?

 

  best way to use the an asynchronous reset if one has to have it. Use a BUGCE for the clock domain, and reset sequence must be as following:

 1. Disable the clock

2. Assert the reset

3. Deassert the reset, let the system wait for something like 32 clocks

4. Enable the clock

 If clock gating fails?

 You want a clock gate pushed into a BUFGCE, the register that creates the gate becomes part of the BUFGCE structure.  If the two gating terms come from registers on different clocks or if they come from somewhere outside the hierarchy being examined by the synthesizer, then the tools may not be able to automatically create the gated clock.  Just taking the gate term as written and using it as the CE input of a BUFGCE would cause a gating delay that doesn't match the RTL.  It also seems that the tools are smart enough to look for other solutions like divide-by-2, but couldn't match those to the RTL, either

  Challenges with clock gating

 1.      Circuit timing changes

2.      Area penalty (additional logic is added)

3.      Equivalency checking (especially for sequential clock gating)

4.      Clock domain crossing issues

5.      Reset domain crossing issues

6.      Complicate timing closure due to additional delay at clock network.

7.      As these changes are local to power domains, I don’t think any impact can come from power states. I am not sure of any impact if the clock gating applied at an architectural level?

8.      DFT methodology changes. The test enable mux can be inserted pre OR post of latch in clock gating cell

 Bypass and in the clock gating

     It can be done, with care and thorough understanding of the possible consequences. These include metastability when taking clocked signals into the gated clock domain and worse results from timing-driven synthesis/layout.But there are nearly always other ways to achieve the same control over a circuit as gating the clock, without all such risks and penalties.

Resource sharing

 Pipelining

 How do I avoid glitch in generating multiple clocks and mux the clocks into one interface clock based on selection lines in Verilog?

 If an input is dumping into FPGA ,define the constraints to be added.

 When setting up constraints in Vivado Design Suite flows, be sure to do the following:

·         Define all primary clocks on input ports or nets connected to input ports.

·         Define clocks on the black box output pins.

·         Define generated clocks on nets.

·         Don’t define gated clocks.

·         Provide correct clock constraints – don’t over-constrain, and be sure to place unrelated (aka asynchronous) clocks in separate clock groups.

·         Define timing exceptions such as false paths and multicycle paths.

Clock gating not used in FPGA .Why?

       Whenever a clock signal is passed through a gate it add skew to the clock signal. ASIC has the flexibility to correct setup and hold time violations occurring due to such gating. It is rectified in design layout process. FPGAs do not have such flexibility to correct the problems. One can calculate and predict such violations but cannot rectify it using buffers as in ASICs. Even if buffer is introduced, the delay, location & numbers of buffer is completely out of our hands. So normal gating can be done in FPGA however highly constraint gating where timing is very important can produce undesirable results. So it NOT RECOMMEND in FPGAs.

 What are the different tests you would do to verify your verilog code?

 To generate non-overlapping clock

Why FPGA use flash?

    
 Flash-based FPGAs non-volatile memory cells hold the configuration pattern right on the chip, and even if power is removed the contents of the flash cells stay intact. Thus when the system restarts, the FPGAs power up in microseconds, saving time and allowing the system to recover quickly from a power failure or a restart.

How to build SPI Flash for FPGA?

How to fix recovery time of an asynchronous reset

 Need to make sure that you are using the asynchronous reset pin of the FF correctly; use the reset bridge described in that post for synchronizing the reset, then  won't have a reset recovery failure on your "real" FFs. However, will/might get one on the asynchronous input to the reset bridge .Since the reset bridge is a synchronizer, reset input to the synchronizer can be declared as a false path:

1. Disable the clock

2. Assert the reset

3. Deassert the reset, let the system wait for something like 32 clocks

4. Enable the clock

  In asynchronous  FIFO,Add the delay.How?

DMA

What is the frequency of waveform in FPGA?

frequency_step = 2^N * frequency_hz / sample_clock_rate_hz

Vivado TCL Flow
=============

Use makefile and tcl to run simulation, synthesis, implmentment and bit generation.
# Makefile
==========
sim:
    vivado -mode batch -source sim.tcl
imp:
    vivado -mode batch -source non-proj.tcl
proj:
    vivado -mode batch -source proj.tcl

------------------------------------------------------------------------------------

# sim.tcl
==========
set path [file dirname [info script]]
puts "script is invoked from $path"
source [file join $path modelsim.tcl]

------------------------------------------------------------------------------------


# modelsim.tcl
================
set file_dir [file normalize [file dirname [info script]]]

puts "== Unit Test directory: $file_dir"
#set ::env(XILINX_TCLAPP_REPO) [file normalize [file join $file_dir .. .. ..    ]]

#puts "== Application directory: $::env(XILINX_TCLAPP_REPO)"
#lappend auto_path $::env(XILINX_TCLAPP_REPO)

set name "modelsim"

create_project $name ./$name -force
add_files -fileset sources_1 "$file_dir/q1.vhd"
#add_files -fileset sources_1 "$file_dir/.vhd"
add_files -fileset sim_1 "$file_dir/tb.vhd"
update_compile_order -fileset sources_1
update_compile_order -fileset sim_1
launch_simulation -batch
close_project


------------------------------------------------------------------------------------


# non-proj.tcl
==============
# STEP#1: setup design sources and constraints
read_vhdl ./q1.vhd
#read_vhdl ./src/AND_TEST.vhd
#read_vhdl ./src/OR_GATE.vhd
# read_vhdl
# read_verilog
read_xdc ./q1.xdc
#
# STEP#2: define the output directory area.
set outputDir ./output
file mkdir $outputDir

#
# STEP#3: run synthesis, write design checkpoint, report timing,
# and utilization estimates
#
synth_design -top q1 -part xc7z020clg484-1
write_checkpoint -force $outputDir/post_synth.dcp
report_timing_summary -file $outputDir/post_synth_timing_summary.rpt
report_utilization -file $outputDir/post_synth_util.rpt
#
# Run custom script to report critical timing paths
#reportCriticalPaths $outputDir/post_synth_critpath_report.csv
#
# STEP#4: run logic optimization, placement and physical logic optimization,
# write design checkpoint, report utilization and timing estimates
#
opt_design
#reportCriticalPaths $outputDir/post_opt_critpath_report.csv
place_design
report_clock_utilization -file $outputDir/clock_util.rpt
#
# Optionally run optimization if there are timing violations after placement
if {[get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]] < 0} {
    puts "Found setup timing violations => running physical optimization"
    phys_opt_design
}
write_checkpoint -force $outputDir/post_place.dcp
report_utilization -file $outputDir/post_place_util.rpt
report_timing_summary -file $outputDir/post_place_timing_summary.rpt


#
# STEP#5: run the router, write the post-route design checkpoint, report the routing # status, report timing, power, and DRC, and finally save the Verilog netlist.
#
route_design
write_checkpoint -force $outputDir/post_route.dcp
report_route_status -file $outputDir/post_route_status.rpt
report_timing_summary -file $outputDir/post_route_timing_summary.rpt
report_power -file $outputDir/post_route_power.rpt
report_drc -file $outputDir/post_imp_drc.rpt
write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno t    rue
#
# STEP#6: generate a bitstream
#
write_bitstream -force $outputDir/proj.bit



------------------------------------------------------------------------------------



# proj.tcl
=========
#
# STEP#1: define the output directory area.
#
set outputDir ./project
file mkdir $outputDir
create_project project_cpu_project ./project -part xc7z020clg484-1 -force

#
# STEP#2: setup design sources and constraints
#
# example:
#add_files -fileset sim_1 ./Sources/hdl/cpu_tb.v
#add_files [ glob ./Sources/hdl/bftLib/*.vhdl ]
#add_files ./Sources/hdl/bft.vhdl
#add_files [ glob ./Sources/hdl/*.v ]
#add_files [ glob ./Sources/hdl/mgt/*.v ]
#add_files [ glob ./Sources/hdl/or1200/*.v ]
#add_files [ glob ./Sources/hdl/usbf/*.v ]
#add_files [ glob ./Sources/hdl/wb_conmax/*.v ]
#add_files -fileset constrs_1 ./Sources/top_full.xdc
#set_property library bftLib [ get_files [ glob ./Sources/hdl/bftLib/*.vhdl]]
add_files -fileset sim_1 ./tb.vhd
add_files ./q1.vhd
add_files -fileset constrs_1 ./q1.xdc

#
# Physically import the files under project_cpu.srcs/sources_1/imports directory 
import_files -force -norecurse
#
#
# Physically import bft_full.xdc under project_cpu.srcs/constrs_1/imports directory 
import_files -fileset constrs_1 -force -norecurse ./q1.xdc
# Update compile order for the fileset 'sources_1'
set_property top top [current_fileset]

update_compile_order -fileset sources_1
update_compile_order -fileset sim_1
#
# STEP#3: run synthesis and the default utilization report.
#
launch_runs synth_1
wait_on_run synth_1
#
# STEP#4: run logic optimization, placement, physical logic optimization, route and
#         bitstream generation. Generates design checkpoints, utilization and timing
#         reports, plus custom reports.
set_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]
set_property STEPS.OPT_DESIGN.TCL.PRE [pwd]/pre_opt_design.tcl [get_runs impl_1]
set_property STEPS.OPT_DESIGN.TCL.POST [pwd]/post_opt_design.tcl [get_runs impl_1]
set_property STEPS.PLACE_DESIGN.TCL.POST [pwd]/post_place_design.tcl [get_runs impl_1]
set_property STEPS.PHYS_OPT_DESIGN.TCL.POST [pwd]/post_phys_opt_design.tcl [get_runs impl_1]
set_property STEPS.ROUTE_DESIGN.TCL.POST [pwd]/post_route_design.tcl [get_runs impl_1]
launch_runs impl_1 -to_step write_bitstream
wait_on_run impl_1
puts "Implementation done!"


------------------------------------------------------------------------------------



# pre_opt_design.tcl
############## pre_opt_design.tcl ##################
set outputDir [file dirname [info script]]/project
source [file dirname [info script]]/reportCriticalPaths.tcl
#
report_timing_summary -file $outputDir/post_synth_timing_summary.rpt
report_utilization -file $outputDir/post_synth_util.rpt
reportCriticalPaths $outputDir/post_synth_critpath_report.csv


------------------------------------------------------------------------------------

# post_opt_design.tcl
############## post_opt_design.tcl ##################
# Run custom script to report critical timing paths
reportCriticalPaths $outputDir/post_opt_critpath_report.csv


------------------------------------------------------------------------------------


# post_place_design.tcl
############## post_place_design.tcl ##################
report_clock_utilization -file $outputDir/clock_util.rpt



------------------------------------------------------------------------------------


# post_phys_opt_design.tcl
############## post_phys_opt_design.tcl ##################
report_utilization -file $outputDir/post_phys_opt_util.rpt
report_timing_summary -file $outputDir/post_phys_opt_timing_summary.rpt



------------------------------------------------------------------------------------



# post_route_design.tcl
############## post_route_design.tcl ##################
report_route_status -file $outputDir/post_route_status.rpt
report_timing_summary -file $outputDir/post_route_timing_summary.rpt
report_power -file $outputDir/post_route_power.rpt
report_drc -file $outputDir/post_imp_drc.rpt
write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno true


------------------------------------------------------------------------------------


# reportCriticalPaths.tcl

#------------------------------------------------------------------------
# This function generates a CSV file that provides a summary of the first
# 50 violations for both Setup and Hold analysis. So a maximum number of
# 100 paths are reported.
#
# #------------------------------------------------------------------------
proc reportCriticalPaths { fileName } {
# Open the specified output file in write mode
    set FH [open $fileName w]
    # Write the current date and CSV format to a file header
    puts $FH "#\n# File created on [clock format [clock seconds]]\n#\n"
    puts $FH "Startpoint,Endpoint,DelayType,Slack,#Levels,#LUTs"
    # Iterate through both Min and Max delay types
    foreach delayType {max min} {
        # Collect details from the 50 worst timing paths for the current analysis
        # (max = setup/recovery, min = hold/removal)
        # The $path variable contains a Timing Path object.
        foreach path [get_timing_paths -delay_type $delayType -max_paths 50 -nworst 1] {
        # Get the LUT cells of the timing paths
        set luts [get_cells -filter {REF_NAME =~ LUT*} -of_object $path]
        # Get the startpoint of the Timing Path object
        set startpoint [get_property STARTPOINT_PIN $path]
        # Get the endpoint of the Timing Path object
        set endpoint [get_property ENDPOINT_PIN $path]
        # Get the slack on the Timing Path object
        set slack [get_property SLACK $path]
        # Get the number of logic levels between startpoint and endpoint
        set levels [get_property LOGIC_LEVELS $path]
        # Save the collected path details to the CSV file
        puts $FH "$startpoint,$endpoint,$delayType,$slack,$levels,[llength $luts]"
        }
    }
    # Close the output file
    close $FH
    puts "CSV file $fileName has been created.\n"
    return 0
}; # End PROC



=======================================================================
======================================================================
FPGA
=====
Write an RTL Block that implements 4x4 matrix  multiplication. then they asked me to optimize the critical path. Then what if the elements are floating point numbers? 
Then check if we can reduce the chip area using less number of Multipliers.
How do you optimize your ASIC/FPGA design
What is speed grade and how do you select FPGA as per requirements
Speed Grade is what that determines how max a clock can run in FPGA. Companies use different values, while -1, -2 indicates the scale. Higher the grade, higher the cost of FPGA
What is the maximum possible speed achievable for a given device say Virtex6 device (some speed grade)
The Fmax is determined by Flop-to-Flop timing using shortest route (CLB) with least clock skew. To put it simple, usually this is calculated based on logic levels between a source to destination path
What is logic level
Number of logics levels in your combo circuit. Google for more info
How do you code to reduce power in FPGA design
Avoid reset for FPGA
Clock Gate
Use synchronous design
Avoid overconstraining
Reduce Device temperature (cooling solution)
Use clk_en and control enable for all Memory
Use LUT for smaller memory. BRAM takes more power
What is clock gating and how do you do it in FPGA
Xilinx and Altera provides primitive components for this. Check them
How do you manage multiple clocks and how do you route them
CDC tools can help this like Spyglass, etc. But asynchronous transfers must be handles carefully in design and later they can be assigned false path for the tool to go easy on compilation
How do you do IO planning and can you explain few types of IO types
how do add debug probes for chipscope and what are the benefits inserting it and inferring it?
Inferring it along with your RTL gives you freedom of debugging whenever required. Adding it as a probe requires you to sort the netlist file and break up your head in finding the net names.
Sometimes some logic needs to be always put on JTAG for status/debug.
How do you program multiple FPGA images in flash and load?
Check google on this
How do you control reset logic
For FPGA, global reset is sufficient. Use async reset for internal logic and sync that reset in main clk (if they are in same clock, else use accordingly)
Always assert the reset asynchronously and de-assert synchronously with clock
How do you do STA
What are the timing constraints that you do for a typical design having synchronous and asynchronous logic.
For async, false path or TIG constraints. We do not do any constraint for async path if it is taken care in RTL
How does the RTL logic convert to logic gate, say a comparator or counter, please describe
Can you write a code in RTL for a debounce logic or 9-bit counter?
Some questions about RTL coding for FPGA primitive components, what are the primitive components and what have you used.
BUFGMUX, ibuff, obuff, etc
How can we find the max clock for a given FPGA program?

FPGA Emulation
------------------

                                                         FPGA Architecture,CLB,LUT,SLICES,BRAM,Distributed RAM,clock buffer and clock resources,synthesis,placement,routing,Timing analysis,constraints for FPGA,setup,hold violation,equations and examples for setup,hold,slack,skew,positive slack,negative slack,equatiopn for postive as well as negative slack.

                Fixing the violations,STA,DTA,Synchroniser,metastability,CDC,VIVADO Tool flow,dcp,drpXilinx ISE flow,Migration of ISE to Vivado,FPGA IO resources,High speed interfaces,MIG,DMA,AXI-DMA,zynq SoC,arm cortex,Vivado reports vs Xilinx ISE reports,PCIe ,ddr3,ddr4,synchronous reset vs asynchronous reset,synchronous memories vs asynchronous memories ,Chipscope-ILA,mmcm vs PLL,DCM,bufg,bufgctrl,IBUF,BUFR,GTX transceiver,eye diagram debug,baud rate,shell,tcl,perl,python

FPGA -rtos in fpga,interrupts,bsp,boot soc,sdk,application program,Vivado HLS,RTL debug,integration methods,LINTING

Protocols-AXI,APB,AXI-APB,AXI lite ,AXI stream,USB 2.0,migration of USB 2.0 -3.0,,i2c,spi,can

BUS-PCIe internals,DDR3 internals
=======================================================================================================================

1. fpga architecture
2. VHDL /verilog code.
3. interfaces (such as PCIE, DDR3..)
4. embedded ARM

what do you mean by JTAG OR USB ?
jtag work thru USB or Parallel port
i mean you will use JTAG anyway , the question should be USB or parallel
in this case i think if you want to design your own dvp board use parallel
it would be easier...
anyway in alera site search for BYTEBLASTER and you will see
a schematics of JTAG CABLE , so you'll be able just to duplicate it.....

=======================================================================================================================


1. What is Metastability and what are its effects?
2. What is setup and hold time?
3. What is pipelining?
4. Why do FPGAs have dedicated clock pins? What's so special about them?
5. Describe a design that you think that will best fit an FPGA.
6. Describe a design that you think that will best fit a Microcontroller.
7. What are the difficulties in asynchronous FIFO design?
========================================================================

For example: convert 14 into hex, binary, and octal (about third of candidates cannot do that)

- Understanding of digital logic: setup and hold, pipelines, latency and throughput,  etc.

- FPGA architecture questions: how different circuits are synthesized, process of timing closure,
 differences between FPGA and ASICs, clock and reset schemes

- Good knowledge of hardware description language (SystemVerilog)

A question about clock domain crossing is typical in an FPGA interview. For example, a signal derived in clock domain A is used in a sequential
 process in clock domain B. What can potentially happen if you use the signal directly in clock domain B? 
What is the proper way to handle the signal before using it in clock domain B?


=====MAX FREQ OF THE CODE ----

FPGA can be programmed with VHDL, VEDRILOG like HDL languages. So basically for a program to be frank, they can run at any frequency and no limitations at all. However when it comes to some particular FPGA, the program gets limited by the particular FPGA architecture.

For example

A small program say a timer can run flawlessly at 400 MHz clock operating frequency on a modern day Virtex-7 like device just like that… But the same running on a decade old Spartan-3 is not at all possible. Here, the program that I’m referring is “Timer”which could be as simple as that. But the limiting factor here is not just the program but the technology. Now let’s re-write the Timer code with logical levels at 1. The Spartan -3 can now get a max clock of 200 and that would be the max limit of that FPGA

→ Key point here is, every FPGA device or any silicon has a physical limit in conducting electricity across one node to other (Node is your FF) and the delay between that two determines the max possible frequency.

Now let’s take another program called “memory controller”. This program is complex and hence achieving 200 MHz on Virtex-7 is hard and you get 190 MHz, whereas on Spartan-3 you get 100 MHz if written well……not a big diff, heh?. This is mainly because when the number of logic levels increase, the 1 clock processing time increases proportionally. By that way, keeping strictly at 1 logic level throughout your code, you can amazingly achieve 500+ MHz for your design on devices like Virtex-7, now ain’t that amazing?, but wait….!!! did I mention that the same will eat up all your resources?. So the designer must keep that in mind and design such as way that he\she keeps the logic levels not too much but 1–3 which can help get good speeds.

So I would say, that a program can be written

To work with best possible speeds
To work with worst possible speeds as well
So what is my limiting factor that prevents my program to execute at MAX frequency?

Type of FPGA and it’s silicon technology
The Slices interconnect wire length and hence the fab tech that was used
The ability of the tools that compile\synthesis\PAR your design. Synopsis or Cadence tools will show you how better they are compared with Xilinx or Altera
Finally the ability of the design engineer who writes HDL code
So, for your question my answer is → NO, you can’t find the max frequency from just a HDL code and theoretically the code can run at any frequency. Still don’t agree?, try your simulation and provide an input clock period of 1 ns which is 1 GHz.

Your major limiting factor here is the Silicon chip that you are targeting.


------

Why can't we use a Clock Source directly rather using FF Dividier circuits which also needs a clock source for its operation??

Is it so? If so my answer is below,

First thing is to reduce the number of clock source in a system or a design.
 Using single clock source we can produce a multiple clcok frequencies with the help of Frequency divider.
 For instance if a Processor needs 48MHz XTAL clock for its core operation where as other devices like SRAM and
 additional pheriperals are working in different feequncies less than a 12MHz then using a FF 
we can easily get a required frequencies without using separate XTALs for each peripherals
Second thing is, By doing so as mentioned above we can avoid timing issues like rise/fall, hold up, 
jitter and so on. (Using multiple clock sources may give above issues)
========================================================================

verify Rtl design using FPGA
============================

First an RTL design needs to be verified using a simulator. We call it this way

Verify = Simulation in EDA tools

Validation = test on board using debugger, etc

When you say verify, you mean the functionality which will be verified using simulator tools first. Then goes other stage of compilation. When you say validate, you mean the functionality verification in board probably running on FPGA. To validate there are many ways to do in hardware and some are

Use JTAG to capture your internal registers or IOs
Use VIO core (say from Xilinx FPGA) and drive some set of internal registers to do DFT style testing
Use a debugger module to communicate through UART or so and run various internal test cases on board (like running in simulation) through scripts. This is how modern day ASICs are verified (but a bit complex).
With devices like FPGA, time consuming gate level simulations have been aged up. These days people do

Functional simulation and do coverage
synthesis and verify netlist
Check timing and area
Do PAR and check for DRC issues and others
Go for validation

Simple RTL syntax of your choice

Test bench structure

What is verification? Ways of verifying things.

Internal workings of FPGA's such as block ram, dsp, logic cell, LUT

What is the most basic way of meeting or checking to see if you are meeting timing? ( hint: clock constraints)

Implement some digital logic

You don't necessarily need to know what a hash table is or how you can implement fir filter. 

 Other's have live very specific skills that are useful but unlikely to be asked during an interview for an entry level position. More general skills are very important:

A firm grasp of digital logic, basically everything covered in digital and advanced digital design courses

Math - (DSP is a bonus but just solid algebra and signals and systems knowledge). Being familiar with fixed point arithmetic is a huge boon.

EE knowledge - Understanding the electrical properties of both the FPGA and the devices it interfaces to is a huge plus.

Scripting - Scripting is used to automate many tasks from builds to testbenches to analysis. TCL is widely used by the FPGA Tools and Python is just about everywhere for the rest.

Understanding Trade Offs - Accuracy for speed for example comes up all the time in FPGA designs.

hat kind of toolchains, ide's, software do you use for your own projects and work projects?
What is the biggest pain you experience ?
What is the top feature/tool/thing you wish you had/could use that would make your life so much easier when designing with HDL?
How do I get this ripple Moore multiplier to work?# 
I am looking for an FPGA (IC, not dev board) for a small project involving HDMI signal processing. I have been looking for an FPGA with the following specifications:

Enough transceiver capacity and I/O for HDMI input and output (doesn't need to be full HD but I'd like at least 720p 30 Hz)

Support for manufacturer's HDMI IP (don't want to reinvent the wheel)

Supported by free version of Quartus/Vivado/etc (not critical, but I would prefer not having to be restricted to the trial period of whatever software I choose) (apparently the trial does not support generating the programming file so that's not an option)

Ideally not BGA, as it would be harder to design the board and especially solder it by hand

# Need advice: Altera DE0 from Digilent or Altera DE10 from Terasic


========================================================================
race around ----


In jk flip flop there occurs a condition called race around when we put both j and k as 1.

In race around Condition till the clock is high the output varies continuously from 0 to 1 &1 to 0. This condition is undesirable as it is of no use because the change in output is uncontrolled. However this change is useful if we have a control on it.To avoid race around condition or to change race around condition to toggling we use master slave configuration in which the output changes whenever a negative edge occurs.This controlled change in output helps in constructing counters which works on toggling Condition of jk flipflop .In short to convert race around condition in jk flip flop to toggling condition we use master slave configuration of jk flip flop.

Let us understand the mechanism of edge triggering in this case. Masteslave configuration actually have two jk flip flop in which one of the flip flop have inverted clock with respect to the other flip flop. These two flip flop are named as MASTER AND SLAVE flip flop. Now as the clock of master is high it is active while during this time the clock of slave is low which provides memory condition. Let our output is Q is 1 and Q' is 0. Consider 1 clock pulse during this pulse when it is high slave will remain in memory state and hence output will not change untill the clock is high…. However master flip flop is active during this time interval( i.e time interval during which a pulse remains high) so it's J will recieve input from output Q' of slave i.e 0 hence produces and output of 1 which similiarly k will produce output of 1 these are the input of slave. As clock pulse becomes low master will became inactive i.e in memory state and slave will became active and with input 0 and 1 it will produces output of Q=0 and Q'=1. So it appears that as negative edge appears our output changes which is the reason why master slave are called negative edge triggered flip flop which is not Nothing special mechanism. Also output of this flip flop changes once in a clock pulse.

If you know the reason of using JK over SR flip flop

then the answer of this question would be

in JK flip flop feedback is changing more than one in a single clock pulse however in master slave JK flip flop,
 feedback has constant value that restricts to change the output value more than one and hence it removes racing between 0 and 1(race around) condition. 
Also it becomes independent from condition of propagation delay.
Either the master works or the slave works at a time, not simultaneously.
Always the slave follows the action of the master . Master changes its its o/p in one half cycle, slave does the same change in the next half. So, one of them is inactive at any time, preventing the serious problem of  race around.
Edge triggering
Regarding the edge triggering, fine, it can be so. Some such devices available. What is required is a low  value in  CR circuit to convert these levels to fast edges.
Master slave ff overcome a race around condition by connect two same circuitary together and one for working and other one for delay . What happens in race around condition is you can't predict the answer though you know the previous state as it changes so fast and hence the slave in master slave hold the ckt i.e. provide some delay . This is how I understand it. The proper working is already explained in some answers by others.

We use edge triggering in this type of conditions but we prefer to use master slave ckt comparitively because in edge trigger j-k flip flop , there are chances of occurances of glitches in the output which doesn't come in master slave.

When we stop providing the clock signal, there won't be any further toggling process but we won't be able to determine the output. It could be either 1 or 0.

This is the race around condition. Elimination of this condition can be done by implementing a JK flip-flop, which has two stages. Since the clock signal is inverted before providing it to the slave-stage, hence at a particular instant only one among the two stages will be active.

So, as a result, whenever the output is generated, it is not immediately fed back as input to the input side, but it remains in between the two stages, that is because the slave-stage will be inactive. This makes the output predictable.

Steps to avoid racing condition in JK Flip flop:

If the Clock On or High time is less than the propagation delay of the flip flop then racing can be avoided. T
his is done by using edge triggering rather than level triggering.
If the flip flop is made to toggle over one clock period then racing can be avoided.
 This introduced the concept of Master Slave JK flip flop.

========================================================================
http://www.design4silicon.com/2016/05/fpga-interview-questions-and-answers.html

========================================================================

Digital design (many topics within digital design), 
CMOS circuit design & underlying concepts (sizing, chain of inverters,
logical effort, combinational & sequential logic design, drive strength,
 power considerations & repercussions , derating for PVT…) , 
FPGA’s , Board related concepts , HDL coding (Systemverilog , Verilog , VHDL), 
Good HDL coding practices, Synthesis concepts, ASIC design concepts, 
Verification methodologies, Basic serial communication protocols & their concepts,
 Basic & Advanced bus theory & knowledge of prominent bus protocols (AMBA - APB / AHB / AXI … ).
 Working level knowledge of a particular protocol or domain - storage (SSD’s / SATA )
 / communication (Ethernet / USB …) / Memory interfaces / DSP knowledge . 
Knowledge of computer architecture (many sub-topics within this topic) 
is central to all concepts involved as everything is connected to a processor core.
 Good working level knowledge of Verification methodologies - UVM / OVM.
These are just technical skills, besides these, you need to know how to use tools and hands on with Unix environment / scripting for automation of tasks .
There are sooooo many vlsi blogs, it’s easy to get lost. Chalk out your study plan
 and start with basics & ask a senio
r manager / colleague in the domain of your interest to mentor 
/ guide you with your career plan.


1.start with samir palanithikar book or j Bhasker design small systems like gates,counters etc

2.Improve your design skills by reading codes of projects in opencores use good editor for understanding code

3.learn using Vio and ILA IP cores and improve your debugging skills

4.with help of Vivado videos on Xlinx website improve your placement and routing skills

Practice make you perfect …design new subsystems and interact actively in Xlinx forum

Hope this will help you get started .

========================================================================
1) What is minimum and maximum frequency of dcm in spartan-3 series fpga?

Spartan series dcm’s have a minimum frequency of 24 MHZ and a maximum of 248

2)Tell me some of constraints you used and their purpose during your design?

There are lot of constraints and will vary for tool to tool ,I am listing some of Xilinx constraints
a) Translate on and Translate off: the Verilog code between Translate on and Translate off is ignored for synthesis.
b) CLOCK_SIGNAL: is a synthesis constraint. In the case where a clock signal goes through combinatorial logic before being connected to the clock input of a flip-flop, XST cannot identify what input pin or internal net is the real clock signal. This constraint allows you to define the clock net.
c) XOR_COLLAPSE: is synthesis constraint. It controls whether cascaded XORs should be collapsed into a single XOR.
For more constraints detailed description refer to constraint guide.

3) Suppose for a piece of code equivalent gate count is 600 and for another code equivalent gate count is 50,000 will the size of bitmap change?in other words will size of bitmap change it gate count change?

The size of bitmap is irrespective of resource utilization, it is always the same,for Spartan xc3s5000 it is 1.56MB and will never change.

4) What are different types of FPGA programming modes?what are you currently using ?how to change from one to another?

Before powering on the FPGA, configuration data is stored externally in a PROM or some other nonvolatile medium either on or off the board. After applying power, the configuration data is written to the FPGA using any of five different modes: Master Parallel, Slave Parallel, Master Serial, Slave Serial, and Boundary Scan (JTAG). The Master and Slave Parallel modes
Mode selecting pins can be set to select the mode, refer data sheet for further details.

5) Tell me some of features of FPGA you are currently using?

I am taking example of xc3s5000 to answering the question .

Very low cost, high-performance logic solution for
high-volume, consumer-oriented applications
- Densities as high as 74,880 logic cells
- Up to 784 I/O pins
- 622 Mb/s data transfer rate per I/O
- 18 single-ended signal standards
- 6 differential I/O standards including LVDS, RSDS
- Termination by Digitally Controlled Impedance
- Signal swing ranging from 1.14V to 3.45V
- Double Data Rate (DDR) support
• Logic resources
- Abundant logic cells with shift register capability
- Wide multiplexers
- Fast look-ahead carry logic
- Dedicated 18 x 18 multipliers
- Up to 1,872 Kbits of total block RAM
- Up to 520 Kbits of total distributed RAM
• Digital Clock Manager (up to four DCMs)
- Clock skew elimination
• Eight global clock lines and abundant routing

6) What is gate count of your project?

Well mine was 3.2 million, I don’t know yours.!

7) Can you list out some of synthesizable and non synthesizable constructs?

not synthesizable->>>>
initial
ignored for synthesis.
delays
ignored for synthesis.
events
not supported.
real
Real data type not supported.
time
Time data type not supported.
force and release
Force and release of data types not supported.
fork join
Use nonblocking assignments to get same effect.
user defined primitives
Only gate level primitives are supported.

synthesizable constructs->>
assign,for loop,Gate Level Primitives,repeat with constant value...

8)Can you explain what struck at zero means?

These stuck-at problems will appear in ASIC. Some times, the nodes will permanently tie to 1 or 0 because of some fault. To avoid that, we need to provide testability in RTL. If it is permanently 1 it is called stuck-at-1 If it is permanently 0 it is called stuck-at-0.

9) Can you draw general structure of fpga?

10) Difference between FPGA and CPLD?

FPGA:
a)SRAM based technology.
b)Segmented connection between elements.
c)Usually used for complex logic circuits.
d)Must be reprogrammed once the power is off.
e)Costly

CPLD:
a)Flash or EPROM based technology.
b)Continuous connection between elements.
c)Usually used for simpler or moderately complex logic circuits.
d)Need not be reprogrammed once the power is off.
e)Cheaper

11) What are dcm's?why they are used?

Digital clock manager (DCM) is a fully digital control system that
uses feedback to maintain clock signal characteristics with a
high degree of precision despite normal variations in operating
temperature and voltage.
That is clock output of DCM is stable over wide range of temperature and voltage , and also skew associated with DCM is minimal and all phases of input clock can be obtained . The output of DCM coming form global buffer can handle more load.


12) FPGA design flow?

13)what is slice,clb,lut?

I am taking example of xc3s500 to answer this question

The Configurable Logic Blocks (CLBs) constitute the main logic resource for implementing synchronous as well as combinatorial circuits.
CLB are configurable logic blocks and can be configured to combo,ram or rom depending on coding style
CLB consist of 4 slices and each slice consist of two 4-input LUT (look up table) F-LUT and G-LUT.

14) Can a clb configured as ram?

YES.

The memory assignment is a clocked behavioral assignment, Reads from the memory are asynchronous, And all the address lines are shared by the read and write statements.

15)What is purpose of a constraint file what is its extension?

The UCF file is an ASCII file specifying constraints on the logical design. You create this file and enter your constraints in the file with a text editor. You can also use the Xilinx Constraints Editor to create constraints within a UCF(extention) file. These constraints affect how the logical design is implemented in the target device. You can use the file to override constraints specified during design entry.

16) What is FPGA you are currently using and some of main reasons for choosing it?

17) Draw a rough diagram of how clock is routed through out FPGA?



18) How many global buffers are there in your current fpga,what is their significance?

There are 8 of them in xc3s5000
An external clock source enters the FPGA using a Global Clock Input Buffer (IBUFG), which directly accesses the global clock network or an Input Buffer (IBUF). Clock signals within the FPGA drive a global clock net using a Global Clock Multiplexer Buffer (BUFGMUX). The global clock net connects directly to the CLKIN input.

19) What is frequency of operation and equivalent gate count of u r project?

20)Tell me some of timing constraints you have used?

21)Why is map-timing option used?

Timing-driven packing and placement is recommended to improve design performance, timing, and packing for highly utilized designs.

22)What are different types of timing verifications?

Dynamic timing:
a. The design is simulated in full timing mode.
b. Not all possibilities tested as it is dependent on the input test vectors.
c. Simulations in full timing mode are slow and require a lot of memory.
d. Best method to check asynchronous interfaces or interfaces between different timing domains.
Static timing:
a. The delays over all paths are added up.
b. All possibilities, including false paths, verified without the need for test vectors.
c. Much faster than simulations, hours as opposed to days.
d. Not good with asynchronous interfaces or interfaces between different timing domains.

23) Compare PLL & DLL ?

PLL:
PLLs have disadvantages that make their use in high-speed designs problematic, particularly when both high performance and high reliability are required.
The PLL voltage-controlled oscillator (VCO) is the greatest source of problems. Variations in temperature, supply voltage, and manufacturing process affect the stability and operating performance of PLLs.

DLLs, however, are immune to these problems. A DLL in its simplest form inserts a variable delay line between the external clock and the internal clock. The clock tree distributes the clock to all registers and then back to the feedback pin of the DLL.
The control circuit of the DLL adjusts the delays so that the rising edges of the feedback clock align with the input clock. Once the edges of the clocks are aligned, the DLL is locked, and both the input buffer delay and the clock skew are reduced to zero.
Advantages:
· precision
· stability
· power management
· noise sensitivity
· jitter performance.


24) Given two ASICs. one has setup violation and the other has hold violation. how can they be made to work together without modifying the design?

Slow the clock down on the one with setup violations..
And add redundant logic in the path where you have hold violations.

25)Suggest some ways to increase clock frequency?

· Check critical path and optimize it.
· Add more timing constraints (over constrain).
· pipeline the architecture to the max possible extent keeping in mind latency req's.

26)What is the purpose of DRC?

DRC is used to check whether the particular schematic and corresponding layout(especially the mask sets involved) cater to a pre-defined rule set depending on the technology used to design. They are parameters set aside by the concerned semiconductor manufacturer with respect to how the masks should be placed , connected , routed keeping in mind that variations in the fab process does not effect normal functionality. It usually denotes the minimum allowable configuration.

27)What is LVs and why do we do that. What is the difference between LVS and DRC?

The layout must be drawn according to certain strict design rules. DRC helps in layout of the designs by checking if the layout is abide by those rules.
After the layout is complete we extract the netlist. LVS compares the netlist extracted from the layout with the schematic to ensure that the layout is an identical match to the cell schematic.

28)What is DFT ?

DFT means design for testability. 'Design for Test or Testability' - a methodology that ensures a design works properly after manufacturing, which later facilitates the failure analysis and false product/piece detection
Other than the functional logic,you need to add some DFT logic in your design.This will help you in testing the chip for manufacturing defects after it come from fab. Scan,MBIST,LBIST,IDDQ testing etc are all part of this. (this is a hot field and with lots of opportunities)

29) There are two major FPGA companies: Xilinx and Altera. Xilinx tends to promote its hard processor cores and Altera tends to promote its soft processor cores. What is the difference between a hard processor core and a soft processor core?

A hard processor core is a pre-designed block that is embedded onto the device. In the Xilinx Virtex II-Pro, some of the logic blocks have been removed, and the space that was used for these logic blocks is used to implement a processor. The Altera Nios, on the other hand, is a design that can be compiled to the normal FPGA logic.

30)What is the significance of contamination delay in sequential circuit timing?


31)When are DFT and Formal verification used?

DFT:
· manufacturing defects like stuck at "0" or "1".
· test for set of rules followed during the initial design stage.

Formal verification:
· Verification of the operation of the design, i.e, to see if the design follows spec.
· gate netlist == RTL ?
· using mathematics and statistical analysis to check for equivalence.

32)What is Synthesis?

Synthesis is the stage in the design flow which is concerned with translating your Verilog code into gates - and that's putting it very simply! First of all, the Verilog must be written in a particular way for the synthesis tool that you are using. Of course, a synthesis tool doesn't actually produce gates - it will output a netlist of the design that you have synthesised that represents the chip which can be fabricated through an ASIC or FPGA vendor.

33)We need to sample an input or output something at different rates, but I need to vary the rate? What's  a clean way to do this?

Many, many problems have this sort of variable rate requirement, yet we are usually constrained with a constant clock frequency. One trick is to implement a digital NCO (Numerically Controlled Oscillator). An NCO is actually very simple and, while it is most naturally understood as hardware, it also can be constructed in software. The NCO, quite simply, is an accumulator where you keep adding a fixed value on every clock (e.g. at a constant clock frequency). When the NCO "wraps", you sample your input or do your action. By adjusting the value added to the accumulator each clock, you finely tune the AVERAGE frequency of that wrap event. Now - you may have realized that the wrapping event may have lots of jitter on it. True, but you may use the wrap to increment yet another counter where each additional Divide-by-2 bit reduces this jitter. The DDS is a related technique. I have two examples showing both an NCOs and a DDS in my File Archive. This is tricky to grasp at first, but tremendously powerful once you have it in your bag of tricks. NCOs also relate to digital PLLs, Timing Recovery, TDMA and other "variable rate" phenomena.
==========================================================================================================================

intel
======

An hour on semiconductor physics, an hour on digital logic design, an hour on process technology, and hour on CPUs and computers
 (I was interviewing for a CPU design job.)

========================================================================

What is a CLB?

CLB refers to the “Configurable Logic Block”s in Xilinx FPGAs

Spartan3 CLB:
4 Slices, each slice has 2 LUTs, 2 FFs
LUTs have 4 inputs

Spartan6 CLB:
2 Slices, each slice has 4 LUTs, 8 FFs
LUTs have 6 inputs. These basic slices are called SliceX. Some slices also have multipliers and carry logic, and they are called SliceLs. And some slices also have capability to use LUTs as distributed RAMs, and also as variable length shift registers. They are called SliceMs.

Virtex6 CLB:
Same as Spartan6. But basic slices are SliceLs. And it also has SliceMs.

7 Series CLB:
LUTs can be configured as 1 6-input LUT, or 2 5-input LUTs. Each slice has 4 6-input LUTs, and 8 FFs. Each CLB has 2 slices. 2/3 of slices are SliceLs, others are SliceMs.

Ultrascale CLB:
Each CLB has one slice, but 2 slices of the 7 series is combined into one cohesive slice. So, each slice has 8 LUTs, and 16 FFs.

Altera has ALM instead of CLB
ALM: Adaptive Logic Module
It has 8 inputs for its LUT, which can implement a full 6-input LUT, or 7 input functions. Each ALM has one of these LUTs, and 2 FFs.

5. Clock management blocks in FPGAs

In Xilinx FPGAs:
Older families like Spartan 3 had DCMs (digital clock manager). New families have CMTs (clock management tile). CMTs have MMCMs and PLLs in them.

6. What are the transceiver types and speeds in Xilinx FPGAs?

Transceivers:
GTP: 6 gbps
GTX: 12.5 gbps
GTH: 13.1 gbps
GTY: 28 gbps on ultrascale, 32 gbps on ultrascale +

UltraScale+ GTR (6.0 Gb/s): Easiest integration of common protocols to the Zynq Processor Subsystem
UltraScale+ GTH (16.3 Gb/s): Low power & high performance for the toughest backplanes
UltraScale+ GTY (32.75 Gb/s): Maximum performance for the fastest optical and backplane applications; 33G transceivers for chip-to-chip, chip-to-optics, and 28G backplanes
UltraScale GTH (16.3 Gb/s): Low power & high performance for the toughest backplanes
UltraScale GTY (30.5 Gb/s): High performance for optical and backplane applications; 30G transceivers for chip-to-chip, chip-to-optics, and 28G backplanes
7 Series GTP (6.6 Gb/s): Power optimized transceiver for consumer and legacy serial standards
7 Series GTX (12.5 Gb/s): Lowest jitter and strongest equalization in a mid-range transceiver
7 Series GTH (13.1 Gb/s): Backplane and optical performance through world class jitter and equalization
7 Series GTZ (28.05 Gb/s): Highest rate, lowest jitter 28G transceiver in a 28nm FPGA

How do they work? What are the tradeoffs between a large, complex logic cell and a simple one? 
Why are their hard macros in FPGAs? What are the downsides of including them? 
Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? (may not even be true anymore) 
What are some of the basic differences between FPGA families? Be able to talk about setup and hold times, too. 
Also, this being the modern era, I'd expect questions about HDLs. If you say you know Verilog and not VHDL then they're not going to hit you with VHDL questions
 for spite. But be sure to be able to talk about Verilog. Understand behavioral, RTL, 
and gate-level coding styles. Understand what synthesis tools can do for you and what they can't, etc.


What happens to gate capacitance as gate oxide thickness is reduced?
 Why is electron mobility higher than hole mobility? Why do they say a NAND gate is functionally complete?
 Explain 1's and 2's complement arithmetic. What is the difference between a "Moore" and "Mealy" machine.
 How do you make a flip-flop from latches. Design a state machine for a traffic light. Why would you want to use a Grey code? 
What is cache for? What is the difference between Harvard and Von Neuman architecture? 
If a read from cache takes 2 cycles but a miss takes 500 cycles, what ratio of hits must be maintained to keep reads on average to 3 cycles?
 What is the difference between a const pointer to an int and a pointer to a const int?

You said you are interviewing for an FPGA design position? I'd be ready to talk about FPGAs. 
How do they work? What are the tradeoffs between a large, complex logic cell and a simple one?
 Why are their hard macros in FPGAs? What are the downsides of including them? Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? 
(may not even be true anymore) What are some of the basic differences between FPGA families? Be able to talk about setup and hold times, too.
 Also, this being the modern era, I'd expect questions about HDLs. If you say you know Verilog and not VHDL then they're not going to hit you with VHDL questions 
for spite..


Most of the questions were oriented towards timing constrait problems.
 There was a question I couldn't answer related to how to make a design run at 50MHz if the synthesis and routing tools say 
"I made my best effort, but it will only be able to run at 25MHz". Since I have never had to face that kind of problems (most of the modules I've made run at 10MHz tops), I told him with all honesty that I needed to research that. He told me that it's ok, and advised me to run the examples in the Xilinx application notes just to have that extra knowledge since they face that kind of problems daily. Since I graudate in May he told me to send him a reminding email around april so that he can make me
 a more formal personal interview. What do you think?

========================================================================
CDC
=====


Clock Domain Crossing

What is Clock domain crossing ?
When a signal or a set of signals requires (due to functionality, data transfer, control info. transfer etc )
 to traverse from one block (working in one clock domain) to another block (working in another clock domain), 
In such a case, clock domain crossing of signal(s) takes place. 

Why 2 blocks may need to work on different clocks ? 
There can be different practical reasons for the same like ::

1. Inside a chip, Some IP can be custom designed (All Steps in VLSI design flow already done and we have a good working IP) to work on one particular frequency to meet the timing requirements of the IP. But It is quite possible, that the IPs with which this IP is interacting can work fine on either faster or slower clocks. so they will be working at different frequencies. So, clock domain crossing scenerios will arise in such  a case.

2. Some IPs are usually bought from other companies and these IPs are also custom designed to work on some particular frequency only.

What problems may arise due to clock domain crossing of signals ?
1. Metastability (Discussed earlier)

How to resolve issues arising because of  clock domain crossing ?
Using different types of synchronizers at the boundary.

========================================================================

Gated clocks in older FPGA’s

A decade ago, ASIC’s needed clock gating to save dynamic power (not static, dynamic!!!) by disabling the clock so the flops inside part of the chip are not triggered when that part is not used. This posed a problem in FPGA prototyping of that ASIC because the clock gating puts logic on the clock network inside the FPGA. Clocks have their own optimized low-skew wire buses which cannot be used easily with the introduction of logic between clock source and the clock lines. Moreover, the system clock is no longer one clock anymore. Several parts of the chip clocked by system clock can be gated off. The ASIC clocktree is balanced specifically for staying synchronous between these parts, but an FPGA cannot guarantee the synchronous assumption of asynchronous clocks even though they have the same frequency.

Gated clocks today

Today, the complex FPGA’s for system-on-chip prototyping have these basic blocks of logic. They contain logic gates and flipflops, basic blocks to construct the digital circuit you want. These flipflops are connected to the special global routed clock lines (dedicated for clocks so, low-skew). And the flops can be gated locally. Instead of one gater gating off the whole clock, the clock is always toggling, but locally in each logic block, the gate off can be used to disable parts of the logic while still keeping everything synchronous (because they all use the same clock).

Importance

An ASIC chip is prototyped in FPGA to allow software developers to start developing their code. But also to verify the digital circuitry with the real software. In the past, digital designers needed to change the code for FPGA to not include these gaters. In the ideal world, the ASIC code should not be changed one single bit because the FPGA must reflect the ASIC behaviour as close as possible. Today, we can avoid changing the ASIC code to prototype it on FPGA because the gating is now possible on FPGA.

You might think, a small change like this, removing clock gaters would not matter much. But every assumption you make, how small or insignificant you think it will be, might turn your multi-million dollar chip into a useless thing. Especially for clocks, where I have seen a chip that assumed the clock had started to run while the external oscillator was expecting to be enabled with a pin from the chip. It was before my time with that company, but the silicon was dead after reset and it took a while before they figured it out what went wrong. And every for every new project, all designers and project leaders were reminded of that project to never assume anything. Not in simulation, not in FPGA, nowhere. A valuable lesson I always keep in mind. And something that today’s digital designers lack. They are forcing their simulation into the right mode of operation for their testcases. I do understand that some IP needs to startup and calibrate, but I see a lot of unnecessary forces (I call them laziness forces) that affect other parts of the chip in a way they can’t even understand. I know that there is no cure for stupidity, but sometimes I wonder why humans are called the intelligent race.

Conclusion

In ASIC clock gating is done for power consumption and in the FPGA prototype, this makes not that much sense since the logic and reconfigurability of the FPGA make the power consumption numbers hardly relevant. But it is extremely important that as much of the ASIC code as possible goes unchanged when the ASIC is prototyped in FPGA. And today, those big FPGA’s (I didn’t look at small ones) support clock gating and that is important for ASIC prototyping.

For low volume products, where an ASIC is not commercially viable, it can be useful to clock gate parts of the FPGA so that consumption can be reduced. Less power is less cooling (active cooling with fan rpm) so the product specification and requirements can require clock gating even if the design is in FPGA which is always consuming more than the same design in the same tech node ASIC (same tech node as the FPGA).




What is FPGA?

ANS: Field Programmable Gate Array is a semiconductor device containing programmable logic components called "logic blocks", and programmable interconnects.
Logic blocks can be programmed to perform the function of basic logic gates such as AND, and XOR, or more complex combinational functions such as decoders or mathematical functions. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.

What are DCM’s? Why they are used?

ANS: Digital clock manager (DCM) is a fully digital control system that uses feedback to maintain clock signal characteristics with a high degree of precision despite normal variations in operating temperature and voltage. That is clock output of DCM is stable over wide range of temperature and voltage, and also skew associated with DCM is minimal and all phases of input clock can be  obtained . The output of DCM coming from global buffer can handle more load.
Q. What are the differences between FPGA and CPLD?
ANS:
FPGA:
A) SRAM based technology.
B) Segmented connection between elements.
C) Usually used for complex logic circuits.
D) Must be reprogrammed once the power is off.
E) Costly
CPLD:
A) Flash or EPROM based technology.
B) Continuous connection between elements.
C) Usually used for simpler or moderately complex logic circuits.
D) Need not be reprogrammed once the power is off.
E) Cheaper

Q. What is DFT ?
ANS: DFT means design for testability. 'Design for Test or Testability' - a methodology that ensures a design works properly after manufacturing, which later facilitates the failure analysis and false product/piece detection Other than the functional logic, you need to add some DFT logic in your design. This will help you in testing the chip for manufacturing defects after it come from fab.  Scan, MBIST, LBIST, IDDQ testing etc. are all part of this. (This is a hot field and with lots of opportunities)

Q. What is Synthesis?

ANS: Synthesis is the stage in the design flow, which is concerned with translating your Verilog code into gates - and that's putting it very simply! First of all, the Verilog must be written in a particular way for the synthesis tool that you are using. Of course, a synthesis tool doesn't actually produce gates - it will output a netlist of the design that you have synthesised that represents the chip which can be fabricated through an ASIC or FPGA vendor.

Q. What is slice? What is CLB?

ANS: The Configurable Logic Blocks (CLBs) constitute the main logic resource for implementing synchronous as well as combinatorial circuits. CLB are configurable logic blocks and can be configured to combo, ram or rom depending on coding style CLB consist of 4 slices and each slice consist of two 4-input LUT (look up table) F-LUT and G-LUT. The memory assignment is a clocked behavioural assignment, Reads from the memory are asynchronous, and all the address lines are shared by the read and write statements.

Q. Can a CLB configured as ram?

ANS: Yes

Q. What is the purpose of DRC?
ANS: DRC is used to check whether the particular schematic and corresponding layout (especially the mask sets involved) cater to a pre-defined rule set depending on the technology used to design. They are parameters set aside by the concerned semiconductor manufacturer with respect to how the masks should be placed, connected, routed keeping in mind that variations in the fab process does not effect normal functionality. It usually denotes the minimum allowable configuration.

Q. Compare PLL & DLL?

PLL:
PLLs have disadvantages that make their use in high-speed designs problematic, particularly when both high performance and high reliability are required. The PLL voltage-controlled oscillator (VCO) is the greatest source of problems. Variations in temperature, supply voltage, and manufacturing process affect the stability and operating performance of PLLs.
DLLs, these however, are immune to these problems. A DLL in its simplest form inserts a variable delay line between the external clock and the internal clock. The clock tree distributes the clock to all registers and then back to the feedback pin of the DLL. The control circuit of the DLL adjusts the delays so that the rising edges of the feedback clock align with the input clock. Once the edges of the clocks are aligned, the DLL is locked, and both the input buffer delay and the clock skew are reduced to zero.
Advantages:
· Precision
· Stability
· Power management
· Noise sensitivity
· Jitter performance.

Q. What is LVs and why do we do that. What is the difference between LVS and DRC?
ANS: The layout must be drawn according to certain strict design rules. DRC helps in layout of the designs by checking if the layout is abide by those rules. After the layout is complete we extract the netlist. LVS compares the netlist extracted from the layout with the schematic to ensure that the layout is an identical match to the cell schematic.

Q. Can you suggest some ways to increase clock frequency?
ANS:
· Check critical path and optimize it.
· Add more timing constraints (over constrain).
· Pipeline the architecture to the max possible extent keeping in mind latency req's.

Q. What is the significance of FPGAs in modern day electronics?
ANS:
         Less time to Market when comparing with ASIC
         It’s better for start-up companies to design their projects using FPGA Rather than ASIC side
         Less Cost and Reconfigurable As many require number of times


Q. FPGA design flow?



Q. Tell me some features of FPGA you are currently using?

1.        FPGA Spartan3E:  (XC3S500E)
The Spartan 3E Starter Board provides a powerful and highly advanced self-contained development platform for designs targeting the Spartan 3E FPGA from Xilinx.
It features a 500K gate Spartan 3E FPGA with a 32 bit RISC processor and DDR interfaces.
The board also features
A Xilinx Platform FlashUSB and JTAG parallel programming interfaces with numerous FPGA configuration options via the onboard Intel StrataFlash and ST Microelectronics Serial Flash.
The board with a power supply and USB cable for programming so designs can be implemented immediately with no hidden costs.
The Spartan 3E Starter board is also compatible with the Micro Blaze Embedded Development Kit (EDK) and Pico Blaze from Xilinx.

2.    FPGA Spartan6: (XC6LX16-CS324)
The system development platform features Xilinx's newest Spartan-6 FPGA, 48Mbytes of external memory (including two non-volatile phase-change memories from Micron), and enough I/O devices and ports to host a wide variety of digital systems.

The on-board high-speed USB2 port provides board power, FPGA programming, and user-data transfers at rates up to 38Mbytes/sec.
A large collection of low-cost peripheral boards, including more than 30 Pmods and several new Vmods, can add additional features to the Nexys3,
 Including A/D and D/A converters, breadboards, motor drivers, displays, etc. The Nexys3 is fully compatible with all Xilinx tools, including the free  Chipscop, EDK (embedded processor design kit) and other tools.

Q. What is LUT?
 Look up Table:
            Look-up tables (LUTs) are used to implement function generators in CLBs.
            Four independent inputs are provided to each of two function generators                (F1-F4 and G1-G4).
            These function generators can implement any arbitrarily defined Boolean function of four inputs.

Q. How to generate clocks on FPGA?
         you need clock source regardless to drive FPGA ,
         inside you can use PLL to generate specific frequencies.
         Also you can use counters to scale down clock




Q. What are FPGA Editor File formats?
          FPGA Editor Reads and writes the following file types:

File Type
Input / Output
Definition
NCD
Input/Output
MAP or PAR generates the .ncd file. FPGA Editor uses this file with the New or Open command from the File menu.  You can edit the NCD file in the FPGA Editor.
PCF
Input/Output
A .pcf file is an ASCII file containing physical constraints created by the MAP program as well as physical constraints entered by you. You can edit the PCF file in the FPGA Editor.
NMC
Input/Output
An .nmc file contains a physical hard macro which can be created or viewed with the FPGA Editor.
ELF
Input
An .elf file (pronounced “elf”) is a binary data file that contains an executable CPU code image, ready for running on a CPU
DRF
Input
A .drf file (pronounced “dwarf”) is a binary data file that also contains the executable CPU code image, plus debug information required by symbolic source-level debuggers.
MEM
Input
A .mem file (memory) is a simple text file that describes contiguous blocks of data.
BIT
Output
A .bit file contains location information for logic on a device, including the placement of CLBs, IOBs, TBUFs, pins, and routing elements. The bitstream also includes empty placeholders that are filled with the logical states sent by the device during a readback. Only the memory elements, such as flip-flops, RAMs, and CLB outputs, are mapped to these placeholders, because their contents are likely to change from one state to another. When downloaded to a device, a bitstream configures the logic of a device and programs the device so that the states of that device can be read back.
CDC
Output
A .cdc file can be generated from the ILA command.



Q. How do you implement DCM?
the manual of my spatran 3E fpga kit says Alternatively, use the FPGA’s Digital Clock Manager (DCM) to generate or synthesize other frequencies from the on-board 50 MHz oscillator.

Q. Why is map-timing option used?
To improve design performance, timing, and packing for highly utilized designs.


Q. What are different types of timing verifications?

1.      Static timing:
2.      Dynamic timing:
a. The delays over all paths are added up.
a. The design is simulated in full timing mode.
b. All possibilities, including false paths, verified without the need for test vectors.
b. Not all possibilities tested as it is dependent on the input test vectors.
c. Much faster than simulations, hours as opposed to days.
c. Simulations in full timing mode are slow and require a lot of memory.
d. Not good with asynchronous interfaces or interfaces between different timing domains.
d. Best method to check asynchronous interfaces or interfaces between different timing domains.

Name a few types of Flip Flops?
JK, D, T. Further Reading: What is a Flip-Flop
Name a few types of latches
SR, JK. Further Reading: What is a Latch
Describe the difference between a flip-flop and a latch
The main difference is that a Flip-Flop uses a clock as an input but a latch does not. The clock input on the flip-flop is used to pass the D input on the Flip-Flop to the Q output. A latch having no clock will latch or hold the output steady. In general, Flip-Flops are used commonly in FPGA designs and latches should be avoided.
Why might you choose to use an FPGA in your design?
FPGAs are highly customizable. Some reasons to use an FPGA in your design might be: many unique peripherals required, lots of input/output needed, very fast processing speed required, lots of math operations, high data throughputs, interfaces to high bandwidth external memory, and reprogrammability are common reasons.
What does a for loop do in synthesizable code? Does it work the same way as in a software language like C?
A for loop in VHDL and Verilog is not the same thing as a for loop in python or C. In synthesizable code, for loops are used to replicate logic. It saves having to type the same thing over and over again, but it does not produce a loop in the same way that software programming loops work. In general, for loops should be avoided in synthesizable code for beginners unless they clearly understand how they work. Further Reading: For Loops in VHDL or Verilog.


Describe the differences between SRAM and DRAM
SRAM means Static Random Access Memory, DRAM stands for Dynamic Random Access Memory. Both SRAM and DRAM are volatile, meaning that their information written to them is cleared when power is removed. However DRAM also needs to be "refreshed" in order to maintain the values written to it, where as SRAM being static does not need this. The refreshing is usually handled by the memory controller. As far as use-cases, SRAM is better when there are short non-sequential reads and writes to memory required, and DRAM is better when there are large burst reads/writes of sequential memory. For example if you're reading 16-bits at a time from any random address in memory on each read, this is likely a better application for SRAM. But if you're trying to store an entire image from a camera to memory this is definitely a better application from DRAM.
What is the purpose of a PLL
PLL stand for Phase-locked loop and is commonly used inside FPGAs to generate desired clock frequencies. PLLs are built-in to the FPGA fabric and are able to take an input clock and derive a unique out-of-phase clock from that input clock. They are very useful if your design requires several unique clocks to be running internally.
Describe the difference between inference and instantiation
Inference is when you write VHDL or Verilog to "infer" or tell the synthesis tools to place some type of component down. For example, by creating a large memory storage register, you might be inferring a Block RAM. Instantiation is when you directly create the primitive component for the Block RAM based on the particular vendor's user guide for how to instantiate primitive components. Inferrence is more portable across FPGA technologies. Instantiation might be better if you need to be very explicit about the primitive that you want to work with, or apply some unique settings to it.
What is metastability, how would you prevent it?
Metastability means that a signal is in an unpredictable or unknown state. It occurs most often when bringing in a signal external to the FPGA into the internal logic. In general it should be avoided. There are several methods to prevent metastability inside your FPGA, a common suggestion is to "double-flop" all inputs into the local clock domain.

What is a FIFO?
FIFO stands for First In First Out. It is a commonly used FPGA component. A FIFO is a storage element, usually made of a Block RAM for large FIFOs and registers for short FIFOs. They are used to buffer data, for example when reading and writing to external memory, or for when crossing clock domains, or for storing pixels coming out of a camera. Further Reading: What is a FIFO
What is a Block RAM?
A Block RAM is a specific part of an FPGA that is usually a 16k or 32k bits storage element. It can have dynamic width and depth and is useful for many applications inside of an FPGA. They are used in Dual-port memories, FIFOs, and LUTs to name a few. More Info: Block RAM in FPGA.
Describe how a UART works and where it might be used?
UART stands for Universal Asynchronous Receiver Transmitter. It is asynchronous, meaning there is no clock involved. It has the ability to both receive and transmit data serially. It is a low-bandwidth, low-pin-count, very simple interface that is commonly used to communicate between FPGAs, or from and FPGA to a microcontroller, or from an FPGA to a computer. Since there is no clock involved, the receiver and the transmitter need to agree on a baud rate and other parameters prior to communication. Further Reading: What is a UART?.
What is the difference between synchronous and asynchronous logic?
Synchronous means there is a clock involved, where as asynchronous means there is no clock. Most logic inside of an FPGA uses synchronous logic, like Flip-Flops, Block RAMs, etc, and things like Resets can be asynchronous. Separately, there can be both synchronous and asynchronous communication protocols. In general, most communication interfaces are synchronous (e.g. SPI, I2C, USB, etc), but a UART is asynchronous.
What is a shift register in an FPGA?
A shift register is a method of moving data from a source to a destination in some number of clock cycles. It is useful for creating delays inside of your FPGA, or for converting serial data to parallel data, or for converting parallel data to serial data. Further Reading: What is a Shift Register.
Describe some differences between VHDL and Verilog?
VHDL is strongly typed, commonly used for high-reliability code (e.g. military and medical devices). It is very verbose, meaning a lot of typing is required. Verilog is closer to C, is less strongly typed, and is generally more forgiving. Further Reading: VHDL vs. Verilog, which language should you learn first?.
What should you be concerned about when crossing clock domains in an FPGA?
Whenever crossing clock domains you should be concerned about creating a metastable condition. In general, it's a good idea to use a primitive that is capable of crossing clock domains, such as a Block RAM, unless you're careful with how you create your register logic and create timing constraints that tell the tools what you're doing. Additionally, you want to make sure that your data storage element is deep enough to cross between the clock domains without losing data. This might happen if you write to a full FIFO or read from an empty FIFO, which happens when one clock domains is expecting data and there is none. Further Reading: Crossing Clock Domains in an FPGA.
Describe setup and hold time, what happens if they are violated?
Setup is the amount of time required for the input of a Flip-Flop to be stable before the clock edge comes along. Hold time is the amount of time required for the input of a Flip-Flop to be stable after the clock edge comes along. If setup or hold time is violated, then there can be a metastable condition inside of your FPGA. This will likely create a timing error inside the place and route tool. You should never have timing errors inside of an FPGA design. Further Reading: Setup and Hold Time in an FPGA.
What's the difference between a melee and moore machine?
Both melee and moore machines are types of state machines that can exist inside your FPGA. Melee machines depend on both the current state as well as inputs to know which state to go to next. Moore machines depend just on your current state to know which state to go to next. Personally, I find Moore machines simpler to code and understand, so I tend to use those.
What is the purpose of the synthesis tools?
The synthesis tools are provided by the FPGA vendor and are used to translate your VHDL or Verilog code into logic that the FPGA is built from (e.g. Flip-Flops, Look-Up Tables, Block RAMs, etc).
What happens during Place and Route?
The synthesis process is usually followed by place and route, which takes the primitives and places them inside the FPGA and checks that the entire design meets your timing constraints. The timing constraints tell the FPGA the clock rates and the specific I/O to use, and the place and route process ensures that your design is able to work at those speeds.
What is a SERDES transceiver and where are they used?
SERDES stands for SERializer/DESerializer. These are high-speed transmitters and receivers that are used to send serial data across a point to point link. Usually these are used at speeds of > 1 Gbps. Current FPGA technology can have SERDES transceivers that operate at > 50 Gbps for a single data link. Data integrity at these speeds is very challenging, so lots of tricks are employed to ensure data is able to pass successfully. They are used commonly in RF applications, high-speed video applications, communication interfaces such as PCI Express, SATA, Gigabit Ethernet, etc.
What is the purpose of a DSP tile in an FPGA?
DSP stands for Digital Signal Processor but it is really a dedicated piece of hardware inside the FPGA that is very good at performing fast multiplication and addition operations. Normal FPGA logic is able to perform multiplies, but not at fast data rates of say 200 MHz or faster. This is where dedicated DSP tiles are used. They are common in filter design or image processing pipelines that require many fast multiplication operations to be performed on input data.
What projects have you done? What technologies have you worked with?
Be prepared to talk in detail about your own experience using FPGAs! Have an interesting project to discuss. Bonus points if you put a link to a GitHub repository on your resume and show your source code.


FPGA APPLICATIONS


Ideal platform for prototyping
Providing fast implementation to reduce time-to-market
Cost effective solutions for products with small volumes on demand
Implementing hardware systems requiring re-programming flexibility
Implementing dynamically re-configurable systems

CLOCK DOMAIN CROSSING


  • Different clocks within a single design

  • Many possible reasons, here are a few:
         Complying with multiple protocols
         SoC Integration:  IP from many sources
         Saving area/power, limiting fast clock use

  • Dangers when signals cross domains
         Incorrect functionality if not synced
Not properly covered by validation
  • Each flop has a setup and hold time
         Setup time:  Time before clk edge
         Hold time:  Time after clk edge


CDC problems


corrupt control and data signal,  are difficult to reproduce and debug, temperature, voltage, and process sensitive, will only occur in hardware; often in the final design. Traditional verification techniques do not work for CDC signals



When 2 or more designs run on disparate clocks:

   The clocks will continually skew, guaranteeing setup/hold violations
   Signals from one design to another are “Clock Domain Crossings” (CDCs)
    Signals that cross asynchronous clock domains (CDC signals) WILL violate setup and hold conditions

Problem:

    Signals crossing a clock domain will violate set-up/hold
    Impact: Control/data signals will be dropped/corrupted
       Loss of Data
Approaches:
Avoid having systems that have multiple clocks
       Although sensible, it’s becoming impossible
  Design around the problem
       Designer can add “synchronizers” to the design
       Metastability still happens, but nobody else sees it
    E.g. 2DFF, FIFO, etc.


“Fences in” metastability

       Designers add synchronizers to reduce the probability of metastable signals
       Synchronizers are sub-circuits that can prevent metastable values from being sampled across clock domains
    Take unpredictable metastable signals and create predictable behavior

       CDC signals cross with an assumed relationship
       Can be combinational, sequential, or deeply sequential
       Unpredictable delays on CDC paths lead to reconvergence errors
    Designs need logic to correctly handle reconvergence
    Can occur on single-bit or multiple-bit signals
       Synchronization between clock domains requires a transfer protocol
    Ensures data is predictably transferred between domains
    These protocols must be verified
       When protocol is violated
    Data is lost
    Simulation may not show a failure
    Silicon will eventually show a functional error

During design planning

  1. Create systems/designs using 1 clk, 1 edge when possible
  2. If multiple clocks are required, try to use 1 designer for both clock domains, and use coding guidelines
    1. Use signal naming conventions
    2. Many clock domain errors come from design changes, not the initial design

       Problem:
    Missing synchronizers will create metastability
    Correctly placed but misused synchronizers won’t work
    Reconvergence of synchronized signals can create unexpected behavior
       Approaches:
    Simulation
       Digital logic simulators do NOT model transistor behavior
       Do not model “metastability”



Asynchronous Clock Domain Crossings Issues


Data expected ready by setup time, and stable during hold time
Source clk not coordinated with dest clk
Chance of arriving in setup/hold window
Possibility of ‘metastable’ (halfway) state
Value may settle either way
Multiple fanouts may detect different values
CDC often done with FV tools
         Mentor 0in, Cadence Conformal, RealIntent
         In other direction, Lint vendors expanding to CDC:  Synopsys LEDA, Atrenta Spyglass
  • Some CDC checks may add assertions
For checking in simulation or FV
  • Simulation: inject random metastability
  • Useful, but hard to get right
         Sometimes delay a value change
         Allow 1-cycle value to be missed
         Depends on # of random tests




FPGAs have rapid average metastability recovery (ps)
 But recovery is unbounded (probability is nonnull that the metastable state will last a given time T)
 Design goal: achieve a desired MTBF, given the recovery parameters of the flip­flop
Synchronizer: Multiple FFs at the receiving end guard against metastability
 Number of FFs a function of desired MTBF and ratio of clock frequencies
Register output from source domain for more predictable timing (increased MTBF)
2FF synchronizers work well when destination clock is faster than source clock (signal will remain stable for at least one destination clock cycle) 

 Fast­ to­ slow crossing requires closed­loop synchronizer with handshake

Closed­loop: Control signal crosses into clock domain 2, then back into clock domain 1, and is checked against the reference (high latency)


Single clock domain

Multiple clocks domain

         Synchronizers 

Circuits that conditions CDC signals in order to  reduce the probability of metastability.


Types of Synchronizers 

         Basic Synchronizer  (Level Signal)

Require synchronizer:  structure to reduce risk of metastability
Simplest synchronizer:  two flops
           Metastability very unlikely:  second flop will choose a value if first is metastable
One-cycle uncertainty in transition

         Edge-detecting Synchronizer 

         Pulse Synchronizer

Clock freq maybe configurable / unknown
Hard to design previous syncs for all cases
Hard to properly check for all cases
Maybe sometimes overconservative
Solution:  handshake protocols
Sender stores data in FIFO
Explicit request/response for data
More logic & harder to check, but powerful 

Basic CDC Checks
  • Structural:  Identify asynchronous CDCs
         Note: synchronous signals are OK, little metastability risk if properly timed
         But same-frequency != synchronous
         If no synchronizer, report error
  • Functional:  Assert signal held long enough
         Due to uncertainty of capture time
         Hold for 2+ receiver cycles: create assert
         May need longer depending on relative freq


CDC Protocols

    In may applications, simple signals aren’t the      only information crossing clock domains.    Data, address and control buses also travel      together across domains

CDC Verification
  1. Assertion Based Verification.
  2. Netlist Analysis.
  3. User Specification of Design Intent.
  4. Verifying CDC Monitors Using Simulation.
5.      Verifying CDC Assertion Using Formal Verification



http://www.national.com/appinfo/power/xilinxfpga.html
https://www.seeedstudio.com/Spartan-Edge-Accelerator-Board-p-4261.html
http://corevlsi.blogspot.com/2014/11/fpga-interview-questions_4.html
http://www.design4silicon.com/2016/05/fpga-interview-questions-and-answers.html

FPGA INTERVIEW QUESTIONS


FPGA  architecture

VHDL /verilog code.

 Interfaces (such as PCIE, DDR3..)

Embedded ARM

What do you mean by JTAG OR USB ?

Write an RTL Block that implements 4x4 matrix multiplication. Then they asked me to optimize the critical path. Then check if we can reduce the chip area using less number of Multipliers.

How do you optimize your ASIC/FPGA design

What is speed grade and how do you select FPGA as per requirements?

What is the maximum possible speed achievable for a given device?

How do you code to reduce power in FPGA design

Answers

Avoid reset for FPGA

Clock Gate

Use synchronous design

Avoid over constraining

Reduce Device temperature (cooling solution)

Use clk_en and control enable for all Memory

Use LUT for smaller memory. BRAM takes more power

What is clock gating and how do you do it in FPGA

What is Clock domain crossing?

Why 2 blocks may need to work on different clocks?

What problems may arise due to clock domain crossing of signals?

How to resolve issues arising because of  clock domain crossing ?

How do you manage multiple clocks and how do you route them?

CDC tools can help this like Spyglass, etc. But asynchronous transfers must be handles carefully in design and later they can be assigned false path for the tool to go easy on compilation

How do you do IO planning and can you explain few types of IO types?

How do add debug probes for chipscope and what are the benefits inserting it and inferring it?

Answers:Inferring it along with your RTL gives you freedom of debugging whenever required. Adding it as a probe requires you to sort the netlist file and break up your head in finding the net names. Sometimes some logic needs to be always put on JTAG for status/debug.

How do you program multiple FPGA images in flash and load?

How do you control reset logic?

For FPGA, global reset is sufficient. Use async reset for internal logic and sync that reset in main clk (if they are in same clock, else use accordingly).Always assert the reset asynchronously and de-assert synchronously with clock

How do you do STA?

What are the timing constraints that you do for a typical design having synchronous and asynchronous logic?

How does the RTL logic convert to logic gate, say a comparator or counter, please describe

Can you write a code in RTL for a debounce logic or 9-bit counter?

RTL coding for FPGA primitive components, what are the primitive components and what have you used.

What are the following?BUFGMUX, IBUF,OBUF?

How can we find the max clock for a given FPGA program?

What is Metastability and what are its effects?

What is setup and hold time?

What is pipelining?

Why do FPGAs have dedicated clock pins? What's so special about them?

Describe a design that you think that will best fit an FPGA.

Describe a design that you think that will best fit a Microcontroller.

What are the difficulties in asynchronous FIFO design?

Differences between FPGA and ASICs, clock and reset schemes

A signal derived in clock domain A is used in a sequential process in clock domain B. What can potentially happen if you use the signal directly in clock domain B? What is the proper way to handle the signal before using it in clock domain B?

Why can't we use a Clock Source directly rather using FF Dividier circuits which also needs a clock source for its operation??

Internal workings of FPGA's such as block ram, dsp, logic cell, LUT

What is the most basic way of meeting or checking to see if you are meeting timing?

Tell me some of constraints you used and their purpose during your design?

What are different types of FPGA programming modes? What are you currently using ?how to change from one to another?

Can you list out some of synthesizable and non synthesizable constructs?

Can you draw general structure of fpga?

Difference between FPGA and CPLD?

MMCM Vs. DCM

What is slice,clb,lut?

What is FPGA you are currently using and some of main reasons for choosing it?

Draw a rough diagram of how clock is routed through out FPGA?

How many global buffers are there in your current FPGA, what is their significance?

What is frequency of operation and equivalent gate count of  project?

Compare PLL & DLL.

Suggest some ways to increase clock frequency?

What is LV and why do we do that. What is the difference between LVS and DRC?

Clock management blocks in FPGAs

What are the transceiver types and speeds in Xilinx FPGAs?

Why are their hard macros in FPGAs? What are the downsides of including them?

Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? (may not even be true anymore)

What are some of the basic differences between FPGA families?



PROTOTYPING OF FPGA



·          Why do we need FPGA prototyping, when IP/SoC verification already exists ?

First of all we need to understand that FPGA Prototyping is not an alternative for IP/SoC verification. IP/SoC verification has its own importance in Silicon Design Flow. As an example let's take a case where we need to verify an IP like RTC (Real Time Clock). Here we know that to simulate a single minute in Verification take more than an hour, then we can very well understand that it would be a nightmare to attempt to simulate one hour of RTC in verification. The other example is when an IP has large register set, then millions of combinations can exist but it is not possible to cover all combinations in verification and even overnight regression can only cover few hundreds of test cases.
Similarly there are may other cases where regression time plays a critical role and FPGA Prototyping can help to complete testing of all combinations. FPGA system clock is typically in MHz and depends on design size, which makes it much more faster than verification.  

·                     Where we can use FPGA Prototyping ?

We have seen one example in previous answer where we can use FPGA Prototyping in Validating IP and gain confidence. This process of FPGA Prototyping used in validating IP before silicon arrives is called Pre-Silicon Validation.
Also today SoC's run huge applications and previously software development teams wait till silicon to arrive then start developing software code on it. But by using FPGA Prototyping software team don't need to wait till silicon, now they can complete all driver and application code much before the silicon arrive.
So FPGA Prototyping is very useful in Pre-Silicon Validation as well as Software development well before actual hardware comes.

·                     What are the inputs required for FPGA Prototyping ?

Only one input required for FPGA Prototyping which is synthesizable RTL design. That means RTL design without all those elements which cannot be synthesized like analog IP's, pads and delay elements etc.

·                     What is the flow for FPGA Prototyping ?

Similar to ASIC Design flow there is FPGA Design flow. Most of the part of this flow is done by tool only, we need not do any manual work for it which makes it easy to create and use.
·                     What are its limitations ?

FPGA Prototyping has it's advantages and limitations
1.    Design Limitation:  It can work on only synthesizable digital design.
2.    Speed Limitation:  It is thousand time faster than verification but it is generally 10-100 times slower than ASIC.
3.    Debug Limitation:  In verification, user can take complete waveform dump and debug the design, while in FPGA we have some tools available where user can configure a 'Trigger' and take a waveform dump but this can be done only for limited signals due to memory constraint of FPGA.
4.    Coverage Limitation: In FPGA user can calculate functional coverage but other parameter like code coverage cannot be calculated. 



FPGA Synthesis workaround

I placed a timing constraint on a path, but the constraint has errors and misses the goal by 10% to 15%. What can I do to make this timing constraint pass?


Solution
If your constraint misses its timing requirement by less than 15%, and the logic delay is less than 60%, try several cost tables at effort level 5, using MPPR (Multi-Pass Place and Route).
I placed a timing constraint on a path, but the constraint has errors and misses the goal by 5% to 10%. What can I do to make this timing constraint pass?

Solution

If your constraint misses its timing requirement by 5% to 10%, and the logic delay is less than 60%, you can try selecting a higher placement effort (4 or 5) and constraining the I/Os, especially data buses.
I placed a timing constraint on a path, but the constraint has errors and too many levels of logic.
How can I make this timing constraint pass?
Solution


This is a case where logic exceeds some percentage of the total path delay, implying that there is too much logic between timing end points; the amount of logic must be reduced in order to meet timing requirements.
This number was traditionally around 50% for older architectures; it would need to be quantified for Virtex families (60%). There are exceptions to this rule for carry chain paths, in which the logic delays are much smaller and would allow for a higher number of logic levels or a lower component percentage.

To reduce the levels of logic, return to the source and try the following:
1.      Issue State Machine Optimization Suggestions,
2.      Use CASE statements instead of nested IF-ELSE statements.
3.      Use 3-state instead of large muxes (7 or more inputs).
4.      Use creative math; shift instead of multiplying by multiples of two.
5.      Use decoders instead of comparators.
6.      Balance logic around registers.
7.      Pyramid logic with parentheses instead of serial implementation.
8.      Use IF-THEN-ELSE statements only to do the following:
a.       Pre-decode and register counter values.
b.      Add a level of pipelining to pre-decode and register input signals.
9.      Use muxes with more than 7-bit wide buses only to do the following:
 .        Instead of logic, use registers that are in a 3-state condition.
a.       Drive enable signals from registers; 3-states are in a 3-state condition when enable signals are "1," and drive signals when the enable is "0".
b.      Use floorplan 3-states.
10.  Add pipeline registers.


I have a path that is failing in my PERIOD constraint, and it does not matter to me that this path is valid every clock cycle.
How can I constrain this path to avoid errors?
Solution

A path that is allowed to take multiple clock cycles to be valid in a design is called a multi-cycle path. These types of paths are typically covered by a PERIOD constraint by default, and might cause errors since a PERIOD is a one-cycle constraint.
To eliminate these errors, you can remove the path from the PERIOD constraint by putting a specific multi-cycle constraint on the path. A multi-cycle constraint is applied by using a "FROM:TO" constraint. A "FROM:TO" constraint has higher priority than a PERIOD constraint, and it removes the specified path(s) from the PERIOD to the "FROM:TO" constraint.
Example syntax:
NET "clk" TNM_NET = "clk";
TIMESPEC TS_clk = PERIOD clk 10 ns;
INST "source_inst_name" TNM = "source_group";
INST "destination_inst_name" TNM = "destination_group";
TIMESPEC TS_01 = FROM "source_group" TO "destination_group" TS_clk*3;
The TS_01 timespec constrains a specific path (or paths) that are only valid every three clock cycles. The "FROM:TO" in TS_01 is constrained to three times the TS_clk timespec (that is, three clock cycles or 30 ns).
For more information on "FROM:TO" constraints, see the Constraints Guide at:
http://toolbox.xilinx.com/docsan/xilinx5/data/docs/cgd/cgd0092_15.html
Suggestions for timing through irrelevant paths such as RESET or ".SR" pin
These paths can be Timing Ignored (TIG), either by paths or by net. The following syntax is for NET and PATH TIG: (Comments = #)
#Net TIG;
NET reset_net TIG;
#Path TIG;
TIMESPEC TS_TIG = FROM source_group TO destination_group TIG;  


uggestions for I/O 3-state enable paths
you can use 3-state flip-flops in the IOB; you must have separate flip-flops for each IOB. To place the flip-flop in the IOB, you can use the "IOB=TRUE" constraint in the UCF.  

long carry logic chains
You can use smaller counters to drive other counters, and cascade them so that the first counter is the enable for the second counter.estions for state machine optimization:
· Use one-hot encoding for the states.
· Use a synthesis state machine coding tool, if possible.
· Reduce the number of input signals and pre-decode the input signals.
· Register input and output signals.
· Pre-decode and register counter values.
· Remove data flow from the state machine and create control signals to control data flow.
· Duplicate state to where many states transition, and reduce the number of states in state equations.
· Use CASE statements and not IF-ELSE statements.

I placed a timing constraint on a path, but the constraint has errors. What can I do to make this timing constraint pass?
Solution

Possible suggestions for high fanout signals:
· Floorplan or LOC the origin and the global buffer of the high fanout signal.
· Duplicate the driver and tell the synthesis tool not to remove the duplicate logic.
· Use specific net fanout control on the specific net, if the synthesis tool allows.


Zynq FPGAs workaround 






Zynq-7000 SoC - Boot Image requirements when using larger than 16MB QSPI and RSA Authentication
Description:The BootROM uses the Linear mode to access the first 16MB of the QSPI flash to look for the boot image.There are limitations on where the boot image could be placed if larger than 16MB QSPI flash and RSA Authentication are used.The following apply only when RSA Authentication is used on larger than 16MB QSPI flashes in single x2 or x4, dual-stacked x4 and dual-parallel x4 configurations.The boot image cannot be placed at 0x0 offset in the flash when using RSA Authentication in a larger than 16MB QSPI device.
Solution
There are three possible work-arounds for this requirement:
·         Erase the first 32KB of flash and program the boot image at 0x0 + 32KB offset.
The BootROM will fail booting from 0x0, will fallback and will boot from 0x0 +32KB offset (see UG585 Zynq-7000-TRM for Boot Partition Search).
·         Program the boot image at 0x0 and Duplicate the Image Header at 0x0 + 16MB offset.
The BootROM will use the Image Header at 0x0 + 16MB offset and then will boot with the boot image programmed at 0x0.
·         Use only single x1 QSPI mode.
NOTE: If RSA is not used, the boot image can be placed at 0x0 even for larger than 16MB QSPI.
Boot Image requirements when using larger than 16MB QSPI with optional Execute-in-Place (XIP) mode
DescriptionThe BootROM uses the linear mode to access the first 16MB of the QSPI flash to look for the boot image.There are limitations on where the boot image could be placed if QSPI flash  larger than 16MB is used with the optional Execute-in-Place mode.The same limitation applies when RSA is used. The QSPI linear mode read command (6bh) might not get the correct data for the first 4 bytes.Looking at the following figure, the incorrect 4 bytes of data are from the upper 16MB of memory. This is a known behavior of the QSPI controller in linear mode.



In the normal QSPI boot mode (not XIP), the first 4 bytes of BootROM header are not used, and checksum is only calculated from 0x020 to 0x044.So the incorrect data can be ignored. QSPI will boot without issues.However, if QSPI boots with XIP, the first word is used to remap the flash linear address space.The following applies only when RSA Authentication is used on QSPI flashes larger than 16MB in single x2 or x4, dual-stacked x4, and dual-parallel x4 configurations.The boot image cannot be placed at 0x0 offset in the flash when using a QSPI device larger than 16MB with the optional Execute-in-Place mode.
Solution
There are three possible work-arounds for this requirement:
·         Erase the first 32KB of flash and program the boot image at 0x0 + 32KB offset.
The BootROM will fail booting from 0x0, then will fallback and boot from 0x0 +32KB offset (see UG585 Zynq-7000-TRM for Boot Partition Search).
·         Program the boot image at 0x0 and Duplicate the Image Header at 0x0 + 16MB offset.
The BootROM will use the Image Header at 0x0 + 16MB offset and will then boot with the boot image programmed at 0x0.
·         Use only single x1 QSPI mode.
NOTE: If XIP is not used, the boot image can be placed at 0x0 even for QSPI larger than 16MB.

QSPI boot time consideration with larger QSPI memory

DescriptionWhen booting a Zynq device with a large (> 16MB) QSPI, such as on the Zedboard, boot times are not improving when a fast QSPI clock ( over 40 MHz) is programmed.  When the user changes the FSBL default QSPI clock from 25MHz (divider by 8) to 100MHz (divider by 2) it is expected that the boot time (to load the bitstream and application) will drop.Instead the boot time stays the same. Why does this happen? 
Solution
This behavior is due to the way the QSPI driver is structured.  
When a QSPI part is greater than 16MB, the driver switches over to I/O access mode from linear.  
I/O requires that the memory be copied in a different manner which is not as fast as the direct copy in linear mode.
This can be worked around by doing the following:
if all of the boot code fits into the lower 16MB, then in qspi.c (located in the FSBL application source directory), the following line can be added in the FlashReadID function in qspi.c, at the end of the  "Deduce flash size" code block, before "return XST_SUCCESS" add the following line:
QspiFlashSize = FLASH_SIZE_128M;
Note: this will not work if the boot image is greater than 16MB or if it crosses the 16MB boundary.
After the boot process is complete, the system code will then control access to the QSPI and access to the full part can be restored.
This answer record is a documentation map providing information about booting a Zynq-7000 SoC device.
It links to documents which cover different modes and configurations for booting a Zynq-7000 device using your boot interface of choice.
Note: This answer record is part of the Xilinx Zynq-7000 SoC Solution Center 
The Xilinx Zynq-7000 SoC Solution Center is available to address all questions related to Zynq-7000 SoC.
Whether you are starting a new design with Zynq-7000 SoC or troubleshooting a problem, use the Zynq-7000 SoC Solution Center to guide you to the right information.

BOOTNG ZYNQ SOC
Solution
UG1046 UltraFast Embedded Design Methodology Guide now contains all of the information about booting a Zynq-7000 SoC device.

Boot Flows and Concepts (FSBL, image creation through BootGEN, multiboot and fallback mechanisms) are described in chapters 6 and 32 of (UG585) Zynq-7000 SoC Technical Ref. Manual and chapter 3 and appendix A of UG821 Zynq-7000 SoC Software Developers Guide.

Below are some more Xilinx Answers relevant for Boot and Configuration. 

Primary Boot Devices

Zynq-7000 SoC supports Quad-SPI, NAND, NOR and SD as primary boot interfaces.

(Xilinx Answer 50991) contains details about which memory vendors and devices families are tested and supported by Xilinx.

Boot Times for Zynq-7000 Devices 

 Zynq-7000 SoC: Boot Times using NAND / QSPI
Considerations on large QSPI devices

QSPI Programming/Booting Checklist

his is a list of required items, necessary actions, and points to be considered, when debugging QSPI programming and booting on Zynq-7000 SoC.

Solution

Before opening a Service Request, collect all of the information requested below. 

1) Is the QSPI flash and configuration supported by Xilinx?

Please provide the full flash name, the configuration type (single, dual parallel, dual stacked) and the voltage. If the configuration is not "standard" (muxes, level shifters or other), provide also the board schematics.

 2) Is Zynq Production Silicon?

Use XMD to read and report the PS_VERSION from 0xF8007080.                           
Please provide Silicon Version reporting register 0xF8007080

 3) Is the JTAG chain operating properly?

Use XMD to try to connect to the CPU.
Please provide JTAG chain description (how many devices on the chain, how many Zynq, Zynq in cascade or independent JTAG, any level shifter in the chain). Report any XMD error.

 

4) In which phase of booting Zynq is failing? BootROM or FSBL?

In order to determine this, program an image with FSBL debug prints enabled. #define FSBL_DEBUG_INFO in fsbl_debug.h
If some printing comes out on the UART during boot:
Please provide a log of the FSBL print out on the UART. FSBL is a user application and can be easily debugged using SDK. Try to do a brief investigation before filing a Service Request.
- If nothing comes out on the UART during boot, first double check the UART baudrate.
Please provide the status of INIT_B (high or low or blinking), REBOOT_STATUS and BOOT_MODE registers after the boot failure.
Most likely the boot image was not programmed properly (continue to step 5).

 

5) Are SDK and iMPACT failing to program?

  • See (Xilinx Answer 59275) for SDK / iMPACT QSPI programming known issues and workarounds.
  • For debug purposes the Debug Environmental Variable XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES can be set to 1. See (Xilinx Answer 59272) for more details.
Please provide the version of the tool used. Be sure your image was built with the same version of the tool used to program.
Please provide the boot mode settings used for programming (booting from JTAG or QSPI).
Please provide the log obtained using the XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES variable.

 

6) Is it working using u-boot?


Use the u-boot.elf pre-built from the latest released image on the wiki, and follow the CTT guide (UG873) including the instructions under "Program QSPI Flash With the Boot Image Using JTAG and U-Boot Command".
Please provide the log of the programming using pre-built u-boot image from the wiki. Specify the u-boot version used.

 

7) Is the board design to support the QSPI frequency used for programming?


Use u-boot and double check the clock settings to verify the QSPI clock frequency (QSPI_REF_CLK and QSPI_CLK on the CLK pin).
Remember that QSPI has two modes of operations depending on if the clock frequency is higher or lower than 40MHz. Calculate and verify the QSPI clock speed.
Please provide the register settings and the calculation done to verify the QSPI clock frequency.

 

8) Is the Xilinx standalone example working?


Some Debugging is required to understand where the example is failing (through the SDK debugger or by adding debug prints).
Is the issue with the initial query of the QSPI or a mismatch between writes and reads?
Is there any error pattern in the read back data? (Maybe a particular bit stuck to 1 or 0).
Report the type of failure in the Xilinx standalone example

SD Programming/Booting Checklist

Solution

Before opening a Service Request, collect all of the information requested below:.

1) SD booting.

SD booting is supported by Xilinx.
See (Xilinx Answer 50991) to determine the benefits and downsides of SD booting.
Please provide the board schematics and the name of the SD Memory Card used.

 

2) Is Zynq Production Silicon?

- Use XMD to read and report the PS_VERSION from 0xF8007080.

3) In which phase of booting is Zynq is failing? BootROM or FSBL?

In order to determine this, program an image with FSBL debug prints enabled.
#define FSBL_DEBUG_INFO in fsbl_debug.h
If some printing comes out on the UART during boot,
Please Provide a log of the FSBL print out on the UART.
FSBL is a user application and can be easily debugged using SDK.

This should be attempted before filing a Service Request.
- If nothing comes out on the UART during boot, first double check the UART baudrate.
     
Please provide the status of INIT_B (high or low or blinking), REBOOT_STATUS and BOOT_MODE registers after the boot failure.

 

4) Are SD_CD and SD_WP properly connected on the board?

  •  FSBL - Fails to boot from SD if the card is write protected (WP is active)
  •  EMIO SD Write Protect and Card Detect Signals Not Properly Configured In Zynq FSBL
Please provide ps7_init.tcl to verify SD_CD and SD_WP are properly configured to be directed to MIO or EMIO (depending on the board schematics).

 

5) Is  the SD running at a supported frequency?

- Check the SD clock configuration. See (UG585) chapter 25 about Clocks.
Please provide the register settings and the calculation done to verify the SD clock frequency.

 

6) Is the JTAG chain operating properly?

- Use XMD to attempt to connect to the CPU.
Please provide JTAG chain description (how many devices on the chain, how many Zynq, Zynq in cascade or independent JTAG, any level shifter in the chain).
Report any XMD error.

 

7) Is the Xilinx standalone example working?

Some Debugging is required to understand where the example is failing (Through the SDK debugger or by adding debug prints).
Report the type of failure found in the Xilinx standalone example.

NAND Programming/Booting Checklist

This is a list of required items, necessary actions, and points to be considered while debugging NAND programming and booting on Zynq-7000 SoC.
Solution
Before opening a Service Request, please collect all of the information below highlighted in Bold.




1.      Is the NAND flash and configuration supported by Xi
To determine the support category (Supported, Limited Support or Unsupported)
Note: Only On-Die ECC (Micron) and 1-bit ECC (Spansion) NAND devices can be used with Zynq-7000 SoC.Zynq NAND flash controller does only provide 1-bit ECC and a single chip select support which means if customer's NAND requires multi-bits of ECC or multiple CS, this NAND cant be used.Micron On-Die ECC NAND which use multi-bits of ECC on chip instead of using ECC bit from controller can provide better ECC support.
Provide the full NAND flash name, the configuration mode and the voltage. If the configuration is not "standard" (muxes, level shifters or other), provide also the board schematics.


2. Are the signals properly connected to the memory device?

Provide a schematic of NAND part and monitor the data signals on the board to verify what kind of words are read.

 


3. Is Zynq Production Silicon?

- Use XMD to read and report the PS_VERSION from 0xF8007080.
  for Silicon Version Register Values               
-If it is not a production silicon, check 
  for the silicon revision differences.

Provide Silicon Version reporting register 0xF8007080
 


4. Is the JTAG chain operating properly?

Use XMD to try to connect to the CPU.

Provide JTAG chain description (how many devices on the chain, how many Zynq, Zynq in cascade or independent JTAG, any level shifter in the chain). Report any XMD error.
 


5. In which phase of booting Zynq is failing? BootROM or FSBL?

There are some issues related to FSBL, first check the following answers

  •  for the booting a bin monolithic Linux image
  •  for booting in secure boot mode
In order to understand this, program an image where FSBL has debug prints enabled. Set the FSBL_DEBUG_INFO FSBL compilation flags.


If some printing comes out on the UART during boot:

Provide a log of the FSBL print out on the UART. FSBL is a user application and can be easily debugged using SDK. Try to do a brief investigation before filing a Service Request.


  • If nothing comes out on the UART during boot, first double check the UART baudrate.
  • Check if the boot image is put to the first 128MB in NAND, the BootROM only searches a limited address.

 Provide the status of INIT_B (high or low or blinking), REBOOT_STATUS and BOOT_MODE registers after the boot failure. Most likely the boot image was not programmed properly (continue to step 5).
 


6. Are SDK and iMPACT failing to program?

If there are multiple devices on the JATG chain, and Zynq is not the first device.

For debug purposes the Debug Environmental Variable XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES can be set to your Windows or Linux machine.

Example of setting a Debug Environmental Variable for Windows:





Example of setting a Debug Environmental Variable for Linux:


setenv XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES 1

NAND programming from Flash Writer requires DDR. Flash Writer needs a FSBL to initialize the board including DDR.

Be sure the FSBL is the same used in the Boot image


Provide the version of the tool used. Be sure your image was built with the same version of the tool used to program.
Provide the boot mode settings used for programming (booting from JTAG or NAND).
Provide the log obtained using the XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES variable.
 


7. Is it working using u-boot?

Use the u-boot.elf pre-built from the latest released image on the wiki, and follow the CTT guide (UG873) that includes the instructions under "Program QSPI Flash With the Boot Image Using JTAG and U-Boot Command".

The flow above for NAND is similar but u-boot command is different to QSPI.

An example below shows

  • nand info
  • nand erase 0 0x100000
  • nand write 0x800 0 0xE1000
Note: The zc702 board does not come with NAND flash so we have not added NAND flash support on zc70x configuration.

If you need this support on existing zc70x configuration, please make the below changes in include/config/zynq_zc70x.h

#define CONFIG_NAND_ZYNQ
And then compile the zc70x as
$ make zynq_zc70x_config
$ make

Provide the log of the programming using pre-built u-boot image from the wiki. Specify the u-boot version used.
 


8. Is the board design to support the NAND frequency used for programming?

Use u-boot and double check the clock settings to verify the NAND clock frequency.

The NAND controller is based on ARM SMC PL353, refer to ARM PrimeCell Static Memory Controller (PL350 series) Technical Reference Manual r2p1 for more timing details.

Check if NAND timing parameters have been set correctly to SMC Timing Calculation.

This setting will reflect to the SMC set_cycles register.

A NAND boot might not work due to an incorrect setting here.

Refer to the NAND device AC Characteristics to input the relevant timing parameters to the CS0 column, the unit is nanoseconds.

The cycles will be calculated based on the operating frequency.


Provide the register settings and the calculation done to verify the NAND clock frequency.


 


9. Is the Xilinx standalone example working?

NAND examples are provided under the SDK install directory

sw\XilinxProcessorIPLib\drivers\nandps_v1_0X_a\examples

Some Debug is needed to understand where the example is failing (through the SDK debugger or by adding debug prints).

Report the kind of failure found in the Xilinx standalone example.
 


10. Users can access NAND correctly when booting from a different source (for example SD), data is correct after read/write operations (for example from u-boot), but my system won't boot from NAND.



Make sure there is a pull-up on the NAND BUSY signal.

The internal pull-up in Zynq is disabled in the BootROM (refer to TRM table 6-11).

 QSPI flash programming now requires that you specify an FSBL

Description
Starting with the 2017.3 release, Vivado Hardware Manager and XSDK require that you specify an FSBL in order to program a QSPI flash.
This has been done in order to have a common flow between Zynq-7000 and Zynq UltraScale+.
Solution
With this change there are a number of implications:
1) The user needs a working FSBL.
If this FSBL is initializing DDR, then DDR needs to be functioning even if the QSPI flash programming does not really use it.
A suggestion is to use #define FSBL_DEBUG_INFO in the FSBL, to check if the UART of the FSBL is fully executed without hangs during QSPI flash programming.

2) If you have issues programming the FLASH in Vivado 2017.3 or 2017.4, add the following environment variable.
(The ENV variable is not required for 2018.1):
XIL_CSE_ZYNQ_UBOOT_QSPI_FREQ_HZ = 10000000
This will force the mini-uBoot to set the QSPI device clock to 10 MHz. 
Note: depending on your FSBL design, you might see a different QSPI clock on your hardware.

3) The device clocking is now configured by the FSBL rather than the tool.
The configuration which was previously used by the tool is listed below. 
If you have issues programming the flash, you should check the FSBL configuration against this table.
Register Name
Register Address
Register Value
ARM_PLL_CFG
0xF8000110
0x00177EA0
(default values)
ARM_PLL_CTRL
0xF8000100
0x0001A000
ARM_PLL = 866 MHz (not bypassed)
ARM_CLK_CTRL
0xF8000120
0x1F000400
CPU_6x4x = 866 / 4 = 216 MHz
IO_PLL_CFG
0xF8000118
0x00177EA0
(default values)
IO_PLL_CTRL
0xF8000108
0x0001A000
IO_PLL = 866 MHz (not bypassed)
PLL_STATUS
0xF800010C
0x0000003F
ARM_PLL and IO_PLL are LOCKED and STABLE.
* Assuming PS_REF_CLK = 33.33 MHz

4) In case of XIP (Execute in place from QSPI), a custom FSBL that executes from OCM needs to be created to be specified during QSPI flash programming.

 SDK is not able to program the NOR if in NOR boot mode on a production silicon

Description
This issue arises when programming an image on the NOR flash using SDK when booting in NOR mode on either of the following:


  • A ZC702 with 7020 production silicon.
  • A ZC706 with 7045 production silicon.
Solution
The reason is that when booting in NOR mode, the BootROM might not find a valid image.

It then goes on a non-secure-lockdown and puts all of the MIO in 3-state.

As a result flash_writer cannot configure the NOR properly.

The work-around is to force the NOR MIO to the correct settings.

Because flash_writer requires an FSBL to configure NOR,  you will need to add  XSmc_NorInit() to the nor.c in FSBL
 
void InitNor(void)
{
        /*
         * Set up the base address for access
         */
        FlashReadBaseAddress = XPS_NOR_BASEADDR;
        XSmc_NorInit();   /* <--Add this line */
}
Note: This is not a silicon requirement, it is a tool limitation.

NAND programming requires the board to be in JTAG mode

NAND programming requires the board to be in JTAG mode.
Solution
To work around this issue, set the boot mode to JTAG boot.

If booting in JTAG mode is not possible (for example if a custom board does not have boot mode jumpers), the FSBL used for NAND programming can be modified.

Instead of reading the bootmode from the MIO:
BootModeRegister = Xil_In32(BOOT_MODE_REG);
BootModeRegister &= BOOT_MODES_MASK;
The code can "fake" the boot mode hard-coding JTAG:
BootModeRegister = JTAG_MODE;
This can be modified in the FSBL main.c file, for the FSBL used to program the NAND only (NOT the FSBL that goes into the boot image).

Note: This is not a Silicon issue, but a tool limitation.

Failed to program QSPI flashes with 256K and 512K erase sector size

According to the datasheet of S25FL512S, the erase sector is 256K.

So from the below error message, 2228224(0x220000) bytes are going to be erased.

But 2228224 is not a multiple of 256k.

This is causing the following error.

zynq-uboot> sf erase 0 220000
SF: 2228224 bytes @ 0x0 Erased: ERROR
Note: If the flashes are in dual parallel configuration, the erase sector size become 512K.
Solution
The work-around is to use u-boot and issue the "sf erase" command with the proper size.
An alternative is to use iMPACT (not Vivado) and set this environmental variable to change the erase sector size:
XIL_CSE_ZYNQ_FLASH_SECTOR_SIZE set to 262144
Note: This issue is scheduled to be fixed in the 2015.3 release. 
From this release on, Vivado will check for the proper erase sector size.

2013.3 / 2013.4 /14.7 SDK, iMPACT, Zynq-7000 - Indirect Flash program does not work with multiple Zynq devices on JTAG chain

Description
There are multiple Zynq devices on the JTAG chain. 
Indirect Flash program from iMPACT and SDK works only for the first Zynq device on the JTAG chain.
How do I program Flash on the other Zynq devices?
Solution
If all Zynq devices are connected to each other in cascaded JTAG mode, change it to independent JTAG mode for the Zynq devices other than the target device.
For example, if there are three Zynq devices on the JTAG chain and you are programming Flash via the third device:
[ARM DAP#1] -> [PL TAP#1] -> [ARM DAP#2] -> [PL TAP#2] -> [ARM DAP#3] -> [PL TAP#3]
Change the first and second device to the independent JTAG mode:
[PL TAP#1] -> [PL TAP#2] -> [ARM DAP#3] -> [PL TAP#3]
This issue will be fixed in the 2014.1 version of the tools.
To work around this issue until then, please follow the steps below:

Note: This solution assumes that 2013.4 tool are used and the patch in (Xilinx Answer 59896) has been applied.

This solution uses the uboot API to copy the image to the QSPI.
 
The uboot ELF can be obtain from the Xilinx Wiki page here
 
Launch SDK, and do the following:
 
  • Connect to 2nd device in the JTAG chain
  • Init the PS7
  • Place the BOOT.bin into DDR (you can use any address, but don't overwrite the uboot at 0x04000000)
  • Download the uboot ELF
  • Execute
 Here are the commands to be used as a reference:
connect arm hw -debugdevice devicenr 3
source ps7_init.tcl
ps7_init
dow -data BOOT.bin 0x08000000
Open the serial port and stop the process from completing by hitting enter.

Next, the QSPI API will be used to copy the (BIN file) image from DDR on to the QSPI.

The commands to do this are seen below:
sf probe 0 0 0
sf erase 0 0x200000
sf write 0x08000000 0 0x200000

Finally, set the Mode pins to boot from QSPI, and do a POR.

The boot image in the QSPI will now be used.


Pros: Fast turn-out time, re-programming capability, dynamic reconfiguration
capability

Cons: performances and integration are not as good as full-custom or standardcell-
based ICs, power consumption

NB: integration issue mitigated by SoC technology (microprocessors + FPGA in the

same device)

=======================================================================

Strategies & directives

  • Add phys_opt_design to the implementation flow. This will do timing based physical optimization which can help with congestion. Multiple iterations of phys_opt_design can also help, with each using different options.
    Also, there is the option to use phys_opt_design post-placement or post-routing. See phys_opt_design -help for more information

  • Vivado has several congestion specific Strategies that can be used (Tools Options -> Strategies). From these Strategies, specific directives for sub-steps such as place_design & route_design can be found that can be useful for congestion.
    Also, the -Explore directive will generally give better results at a cost of increased run-time.
  • Try using the Vivado Synthesis AlternateRoutability directive.

  • Try iterating through different place_design directives (found with place_design -help). Specifically, the SpreadLogic_high/medium/low and AltSpreadLogic_high/medium/low directives are meant to spread logic to prevent congestion.
    Comparing the results with different place_design directives and running through route_design will give you an idea of which directives perform better for a specific design.
    Please note that the best performing directive can change as the design changes. "report_timing_summary" and "report_design_analysis -congestion" can be used to compare the different directives.

  • Finding a suitable opt_design directive is also helpful for a congested design. Run report_utilization after each iteration to see which yields the lowest LUT & FF count, depending on the which element is highly utilized.

  • Try using the Vivado Synthesis option -resource_sharing on value. This can often share arithmetic operators and the default is set to auto.

  • Try taking certain timing critical paths and over constraining them only during place_design and phys_opt_design. This prioritizes these paths which can lead to better QOR.

  •  Try floorplanning block RAMs or DSPs. It can be helpful to floorplan block RAMs or DSPs using the best timing results from iterating through different place_design directives, and fixing these LOC constraints for further implementation runs.
    There are a few options to obtain these constraints:
  1. From a routed design, select the Block RAMs or DSPs that you wish to fix, and right-click within the Device Window and select "Fix Cells". Saving the design at this point will save the fixed constraints.
  2. To print out LOC constraints for the current block RAM or DSP placement, use the below commands.
    The results can be copied into an XDC file.
    Similar syntax can be used for DSPs, and the search pattern can be found from Vivado IDE Find window.
     
set BRAMS [get_cells -hierarchical -filter { PRIMITIVE_TYPE =~ BLOCKRAM.BRAM.* } ]
foreach i $BRAMS {puts "set_property LOC [get_property LOC [get_cells $i]] \[get_cells \{${i}\}\]"}

Reducing & Controlling fanout
  • Use the -force_replication_on_nets option of phys_opt_design. This is a good option for reducing fanout as a post-place phys_opt_design will be able to use the placement information to decide which nets are driven from replicated drivers so that the path length is not excessive.
    Example command: "phys_opt_design -force_replication_on_nets ${hi_fanout_nets}" where hi_fanout_nets is a list of nets you would like to reduce the fanout of.
    See phys_opt_design -help for more information. A Tcl example is attached to this Answer Record that finds synchronously driven high fanout nets and creates a variable with a list of nets that can be used in the phys_opt_design command.
    •   
  • Use the -fanout_opt place_design option. Available with 2017.x, the -fanout_opt performs high fanout replication of critical nets. See place_design -help for more information on the option.

  • Use global buffers on non-clock high-fanout nets. The opt_design command can automatically insert BUFGs on high fanout nets.
    Using global clocking resources can help congestion due to high fanout nets. Consult the report_high_fanout report from the routed design to see if there are potential candidates. Also, automatic BUFG insertion by opt_design can be adjusted. See (Xilinx Answer 54177) for more information.
Reducing Local Congestion
  • Try reducing/removing LUT combining from synthesis (-no_lc). This can reduce the number of nets entering CLBs that become congested due to LUT inputs.

  • Try enlarging or removing pblock constraints if the logic constrained is related to the congested region. This gives the placer more flexibility to avoid the congestion.
Reducing utilization
  • Analyze the necessity of resets. Some reset signals might not be necessary if the GSR can provide the necessary initialization. Consult Chapter 4 (RTL Coding Guidelines) of (UG949) for more information
======================================================================
                                    Vivado  Non project mode

# Step 1: Read in top-level EDIF netlist from synthesis tool read_edif c:/top.edf # Read in lower level IP core netlists read_edif c:/core1.edf read_edif c:/core2.edf

 # Step 2: Specify target device and link the netlists # Merge lower level cores with top level into single design link_design -part xc7k325tfbg900-1 -top top

# Step 3: Read XDC constraints to specify timing requirements read_xdc c:/top_timing.xdc # Read XDC constraints that specify physical constraints such as pin locations read_xdc 
c:/top_physical.xdc

# Step 4: Optimize the design with default settings opt_design

# Step 5: Place the design using the default directive and save a checkpoint # It is recommended to save progress at certain intermediate steps # The placed checkpoint can also be routed in multiple runs using different options place_design -directive Default write_checkpoint post_place.dcp

# Step 6: Route the design with the AdvancedSkewModeling directive. For more information # on router directives type 'route_design -help' in the Vivado Tcl Console route_design -directive AdvancedSkewModeling

# Step 7: Run Timing Summary Report to see timing results report_timing_summary -file post_route_timing.rpt # Run Utilization Report for device resource utilization report_utilization -file post_route_utilization.rpt

 # Step 8: Write checkpoint to capture the design database; # The checkpoint can be used for design analysis in Vivado IDE or TCL API write_checkpoint post_route.dcp

The key steps in the Non-Project Mode  are:

 • Step 1: Read Design Source Files
• Step 2: Build the In-Memory Design
 • Step 3: Read Design Constraints
• Step 4: Perform Logic Optimization
 • Step 5: Place the Design
• Step 6: Route the Design
 • Step 7: Run Required Reports
 • Step 8: Save the Design Checkpoint
Strategy selects the strategy to use for the implementation run.


Defining Implementation Strategies A run strategy is a defined approach for resolving the synthesis or implementation challenges of the design.

 • Strategies are defined in pre-configured sets of options for the Vivado implementation features. 
• Strategies are tool and version specific.

• Each major release of the Vivado Design Suite includes version-specific strategies.

Vivado implementation includes several commonly used strategies that are tested against internal benchmarks.

 Vivado Design Suite includes a set of pre-defined implementation strategies, or you can create your own. 


 Options When you select a strategy, each step of the Vivado implementation process displays in a table in the lower part of the dialog box:
 • Opt Design (opt_design)
• Power Opt Design (power_opt_design) (optional)
• Place Design (place_design)
• Post-Place Power Opt Design (power_opt_design) (optional)
• Post-Place Phys Opt Design (phys_opt_design) (optional)
• Route Design (route_design)
• Post-Route Phys Opt Design (phys_opt_design) (optional)
• Write Bitstream (write_bitstream)
 Click the command option to view a brief description of the option at the bottom of the Design Run Settings dialog box.



===============================================================

Use makefile and tcl to run simulation, synthesis, implmentment and bit generation

 

# Makefile

sim:

    vivado -mode batch -source sim.tcl

imp:

    vivado -mode batch -source non-proj.tcl

proj:

    vivado -mode batch -source proj.tcl

 

------------------------------------------------------------------------------------

 

# sim.tcl

set path [file dirname [info script]]

puts "script is invoked from $path"

source [file join $path modelsim.tcl]

 

------------------------------------------------------------------------------------

 

 

# modelsim.tcl

set file_dir [file normalize [file dirname [info script]]]

 

puts "== Unit Test directory: $file_dir"

#set ::env(XILINX_TCLAPP_REPO) [file normalize [file join $file_dir .. .. ..    ]]

 

#puts "== Application directory: $::env(XILINX_TCLAPP_REPO)"

#lappend auto_path $::env(XILINX_TCLAPP_REPO)

 

set name "modelsim"

 

create_project $name ./$name -force

add_files -fileset sources_1 "$file_dir/q1.vhd"

#add_files -fileset sources_1 "$file_dir/.vhd"

add_files -fileset sim_1 "$file_dir/tb.vhd"

update_compile_order -fileset sources_1

update_compile_order -fileset sim_1

launch_simulation -batch

close_project

 

 

------------------------------------------------------------------------------------

 

 

# non-proj.tcl

# STEP#1: setup design sources and constraints

read_vhdl ./q1.vhd

#read_vhdl ./src/AND_TEST.vhd

#read_vhdl ./src/OR_GATE.vhd

# read_vhdl

# read_verilog

read_xdc ./q1.xdc

#

# STEP#2: define the output directory area.

set outputDir ./output

file mkdir $outputDir

 

#

# STEP#3: run synthesis, write design checkpoint, report timing,

# and utilization estimates

#

synth_design -top q1 -part xc7z020clg484-1

write_checkpoint -force $outputDir/post_synth.dcp

report_timing_summary -file $outputDir/post_synth_timing_summary.rpt

report_utilization -file $outputDir/post_synth_util.rpt

#

# Run custom script to report critical timing paths

#reportCriticalPaths $outputDir/post_synth_critpath_report.csv

#

# STEP#4: run logic optimization, placement and physical logic optimization,

# write design checkpoint, report utilization and timing estimates

#

opt_design

#reportCriticalPaths $outputDir/post_opt_critpath_report.csv

place_design

report_clock_utilization -file $outputDir/clock_util.rpt

#

# Optionally run optimization if there are timing violations after placement

if {[get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]] < 0} {

    puts "Found setup timing violations => running physical optimization"

    phys_opt_design

}

write_checkpoint -force $outputDir/post_place.dcp

report_utilization -file $outputDir/post_place_util.rpt

report_timing_summary -file $outputDir/post_place_timing_summary.rpt

 

 

#

# STEP#5: run the router, write the post-route design checkpoint, report the routing # status, report timing, power, and DRC, and finally save the Verilog netlist.

#

route_design

write_checkpoint -force $outputDir/post_route.dcp

report_route_status -file $outputDir/post_route_status.rpt

report_timing_summary -file $outputDir/post_route_timing_summary.rpt

report_power -file $outputDir/post_route_power.rpt

report_drc -file $outputDir/post_imp_drc.rpt

write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno t    rue

#

# STEP#6: generate a bitstream

#

write_bitstream -force $outputDir/proj.bit

 

 

 

------------------------------------------------------------------------------------

 

 

 

# proj.tcl

#

# STEP#1: define the output directory area.

#

set outputDir ./project

file mkdir $outputDir

create_project project_cpu_project ./project -part xc7z020clg484-1 -force

 

#

# STEP#2: setup design sources and constraints

#

# example:

#add_files -fileset sim_1 ./Sources/hdl/cpu_tb.v

#add_files [ glob ./Sources/hdl/bftLib/*.vhdl ]

#add_files ./Sources/hdl/bft.vhdl

#add_files [ glob ./Sources/hdl/*.v ]

#add_files [ glob ./Sources/hdl/mgt/*.v ]

#add_files [ glob ./Sources/hdl/or1200/*.v ]

#add_files [ glob ./Sources/hdl/usbf/*.v ]

#add_files [ glob ./Sources/hdl/wb_conmax/*.v ]

#add_files -fileset constrs_1 ./Sources/top_full.xdc

#set_property library bftLib [ get_files [ glob ./Sources/hdl/bftLib/*.vhdl]]

add_files -fileset sim_1 ./tb.vhd

add_files ./q1.vhd

add_files -fileset constrs_1 ./q1.xdc

 

#

# Physically import the files under project_cpu.srcs/sources_1/imports directory

import_files -force -norecurse

#

#

# Physically import bft_full.xdc under project_cpu.srcs/constrs_1/imports directory

import_files -fileset constrs_1 -force -norecurse ./q1.xdc

# Update compile order for the fileset 'sources_1'

set_property top top [current_fileset]

 

update_compile_order -fileset sources_1

update_compile_order -fileset sim_1

#

# STEP#3: run synthesis and the default utilization report.

#

launch_runs synth_1

wait_on_run synth_1

#

# STEP#4: run logic optimization, placement, physical logic optimization, route and

#         bitstream generation. Generates design checkpoints, utilization and timing

#         reports, plus custom reports.

set_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]

set_property STEPS.OPT_DESIGN.TCL.PRE [pwd]/pre_opt_design.tcl [get_runs impl_1]

set_property STEPS.OPT_DESIGN.TCL.POST [pwd]/post_opt_design.tcl [get_runs impl_1]

set_property STEPS.PLACE_DESIGN.TCL.POST [pwd]/post_place_design.tcl [get_runs impl_1]

set_property STEPS.PHYS_OPT_DESIGN.TCL.POST [pwd]/post_phys_opt_design.tcl [get_runs impl_1]

set_property STEPS.ROUTE_DESIGN.TCL.POST [pwd]/post_route_design.tcl [get_runs impl_1]

launch_runs impl_1 -to_step write_bitstream

wait_on_run impl_1

puts "Implementation done!"

 

 

------------------------------------------------------------------------------------

 

 

 

# pre_opt_design.tcl

############## pre_opt_design.tcl ##################

set outputDir [file dirname [info script]]/project

source [file dirname [info script]]/reportCriticalPaths.tcl

#

report_timing_summary -file $outputDir/post_synth_timing_summary.rpt

report_utilization -file $outputDir/post_synth_util.rpt

reportCriticalPaths $outputDir/post_synth_critpath_report.csv

 

 

------------------------------------------------------------------------------------

 

# post_opt_design.tcl

############## post_opt_design.tcl ##################

# Run custom script to report critical timing paths

reportCriticalPaths $outputDir/post_opt_critpath_report.csv

 

 

------------------------------------------------------------------------------------

 

 

# post_place_design.tcl

############## post_place_design.tcl ##################

report_clock_utilization -file $outputDir/clock_util.rpt

 

 

 

------------------------------------------------------------------------------------

 

 

# post_phys_opt_design.tcl

############## post_phys_opt_design.tcl ##################

report_utilization -file $outputDir/post_phys_opt_util.rpt

report_timing_summary -file $outputDir/post_phys_opt_timing_summary.rpt

 

 

 

------------------------------------------------------------------------------------

 

 

 

# post_route_design.tcl

############## post_route_design.tcl ##################

report_route_status -file $outputDir/post_route_status.rpt

report_timing_summary -file $outputDir/post_route_timing_summary.rpt

report_power -file $outputDir/post_route_power.rpt

report_drc -file $outputDir/post_imp_drc.rpt

write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno true

 

 

------------------------------------------------------------------------------------

 

 

# reportCriticalPaths.tcl

 

#------------------------------------------------------------------------

# This function generates a CSV file that provides a summary of the first

# 50 violations for both Setup and Hold analysis. So a maximum number of

# 100 paths are reported.

#

# #------------------------------------------------------------------------

proc reportCriticalPaths { fileName } {

# Open the specified output file in write mode

    set FH [open $fileName w]

    # Write the current date and CSV format to a file header

    puts $FH "#\n# File created on [clock format [clock seconds]]\n#\n"

    puts $FH "Startpoint,Endpoint,DelayType,Slack,#Levels,#LUTs"

    # Iterate through both Min and Max delay types

    foreach delayType {max min} {

        # Collect details from the 50 worst timing paths for the current analysis

        # (max = setup/recovery, min = hold/removal)

        # The $path variable contains a Timing Path object.

        foreach path [get_timing_paths -delay_type $delayType -max_paths 50 -nworst 1] {

        # Get the LUT cells of the timing paths

        set luts [get_cells -filter {REF_NAME =~ LUT*} -of_object $path]

        # Get the startpoint of the Timing Path object

        set startpoint [get_property STARTPOINT_PIN $path]

        # Get the endpoint of the Timing Path object

        set endpoint [get_property ENDPOINT_PIN $path]

        # Get the slack on the Timing Path object

        set slack [get_property SLACK $path]

        # Get the number of logic levels between startpoint and endpoint

        set levels [get_property LOGIC_LEVELS $path]

        # Save the collected path details to the CSV file

        puts $FH "$startpoint,$endpoint,$delayType,$slack,$levels,[llength $luts]"

        }

    }

    # Close the output file

    close $FH

    puts "CSV file $fileName has been created.\n"

    return 0

}; # End PROC

 

 

 =======================

 

Start with no timing constraints

Use IDE to view the clock networks

Create basic clocks

Review Timing reports for coverage

Open synthesized design .See schematic

Report clock networks.Click constraints

See unconstrained

-------------------------------------------------------------------------------------------------------------

 

Before editing go for report_clocks

Edit timing constraints


 

Click generated clock



Description:

Create a generated clock object

 

Syntax:

create_generated_clock  [-name <arg>] [-source <args>] [-edges <args>]

                        [-divide_by <arg>] [-multiply_by <arg>]

                        [-combinational] [-duty_cycle <arg>] [-invert]

                        [-edge_shift <args>] [-add] [-master_clock <arg>]

                        [-quiet] [-verbose] <objects>

 

Returns:

new clock object

 

Usage:

  Name              Description

  -----------------------------

  [-name]           Generated clock name

  [-source]         Master clock source object pin/port

  [-edges]          Edge Specification

  [-divide_by]      Period division factor: Value >= 1

                    Default: 1

  [-multiply_by]    Period multiplication factor: Value >= 1

                    Default: 1

  [-combinational]  Create a divide_by 1 clock through combinational logic

  [-duty_cycle]     Duty cycle for period multiplication: Range: 0.0 to 100.0

                    Default: 0.0

  [-invert]         Invert the signal

  [-edge_shift]     Edge shift specification

  [-add]            Add to the existing clock in source_objects

  [-master_clock]   Use this clock if multiple clocks present at master pin

  [-quiet]          Ignore command errors

  [-verbose]        Suspend message limits during command execution

  <objects>         List of clock source ports, pins, or nets

 

Categories:

SDC, XDC

report_timing

if its ports go generic

if it is pins,do tcl for get_pins

and do report_timing  -from[get_pins     ..]-to[get_pins   ..]]

view contents of the report

Timing reports always start from primary clock propagate to the generated clocks and then on to the clock elements.

observe the  destination path clock timing.

See the requirements of the generated clock

See the destination  clock start of next clock edge and on to destination register

Slack is the required time – arrival time

Open the schematic of the netlist ,select clk_gen.window.zoom it

Observe the difference in schematic which is periodically enabled to generate the destination clocks.

In this case,generated clocks doesnot have the predefined relationship with the primary clocks clk pin

As a result,create a tcl command

create_generated clock

save the constraints

enter the tcl command report_clocks

observe the new generated clock included in the timing reports

click report timing summary

select new generated clk

see intraclk path,see the setup.double click any path to view the path properties

see the source clock path

source clock delay starts primary clock and propgate to generated clocks both automatically and manual generated clocks

see the destination clock path

starts from primary clock and propagated to generated clocks

close the properties.

 

set input_delay

report_timing –from[all_inputs]

see the reports

set_input_delay –clock[get_clocks –phyclk0] –max3$my_input

report_timing –from$my_input –setup

see the reports

see the slack and input delay

insert the below tcl

set_input_delay –clock[get_clocks –phyclk0] –min1$my_input

report_timing –from$my_input –hold

report shows actual slack and the input delay

In addition see the edit timing constraint, can add set_input_delay

 

set output_delay

 

report_timing –to[all_inputs]

see the reports

set_input_delay –clock[get clocks –sysclk0] $my_output

report_timing –from$my_input –hold

see the reports

see the actual slack and input delay

In addition see the edit timing constraint, can add set_output_delay

 

set clock groups

Open the synthesized design

Report clock interaction

Analyze the timing path from one clock domain to another clock domain

Report shows grid of clock interactions

Each grid is colored to indicate timing ,constraint status in inter clock path

If the two clock groups does not meet timing it is asynchronous

Clock frequencies are not integer multiples .Its impossible to find the relative fields between them.

 

report_clock_interactions –delay_type min_max –significant_digits 3 –name timing_1

see the wns and choose max

suppose if you have two constraints like clk_out1_clk_core, clk_out2_clk_core=>

 

join [get_timing_paths –from [get_clocks clk_out1_clk_core] –to[get_clocks clk_out2_clk_core]

-max_paths 200]

 

join [get_timing_paths –from [get_clocks clk_out2_clk_core] –to[get_clocks clk_out1_clk_core]

-max_paths 200]

 

Two clocks marked as asynchronous

Launch timing contraints-edit

Double click set clock_group

You need to specify two clock groups

Add the first clock group

Add the second by click + sign

Note asynchronous is chosen  by default.

Save the constarints

Report clock interactions

Observe the interclock path between clock grp1 and 2 is decalred blue confirming asynchronous

 

Set false path


 

report timing_summary

See the interclock path

See the setup

set_false_path –from[get_pins   ]-to[get_pins ]]

report_timing –from [get_pins   ]-to[get_pins ]]

if sklack is infinite false path is sucessful

if not check the false path

 

set multicycle path

report_timing

See the interclock path

See the setup

 

 

 

 -hold

 

 

 

 - hold

 

See the vio;ated slack if any

See the requirement

 

You can see the change in slack

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Description:

 

  Generate a new clock object from an existing physical clock object in the

  design.

 

  Clocks can be added to a design in one of three ways:

 

   *  Primary physical or virtual clocks defined with the create_clock command.

 

   *  Derived clocks defined with the create_generated_clock command

      generated from a primary physical clock.

 

   *  Derived clocks automatically generated by the Vivado Design Suite when

      a clock propagates to an MMCM/PLL/BUFR.

 

  You can also use the create_generated_clock command to change the name of   clocks that the Vivado tool has auto-derived from an MMCM/PLL/BUFR. In this   case, a new clock is not created, but an existing clock defined on the   specified source object is renamed to the provided name. This requires   -name and <object> to be specified, and supports the use of -source and/or   -master_clock to further identify the clock to rename when multiple clocks   exist on the source object. Refer to the Vivado Design Suite User Guide:

  Using Constraints (UG903) for more information on renaming auto-derived   clocks.

 

  Note: You cannot rename a clock that is already in use by other constraints   at the time of renaming. You must rename the clock prior to any other   appearance or use of the clock in an XDC file

 

  This command returns the name of the clock object that is created, or   returns an error if it fails.

 

Arguments:

  -name <arg> - (Optional) The name of the generated clock to create on the   specified object, or the name to assign to an existing clock on the   specified object. If no name is specified, the generated clock will be   given the name of the <object> it is assigned to. If assigned to multiple

  <objects>, the name will be the first object in the list.

 

  -source <arg> - (Optional) The pin or port of the master clock from which   to derive the generated clock. The master clock must be a previously   defined physical clock, not a virtual clock; but can be a primary clock or   another generated clock. If the source pin or port currently has multiple   clocks defined, the -master_clock option must be used to identify which   clock on the source is to be used to define the generated clock.

 

  -edges< <arg> - (Optional) Specifies the edges of the master clock to use   in defining transitions on the generated clock. Specify transitions on the   generated clock in a sequence of 1, 2, 3, by referencing the appropriate   edge count from the master clock in numerical order, counting from the   first edge. The sequence of transitions on the generated clock defines the   period and duty cycle of the clock: position 1 is the first rising edge of   the generated clock, position 2 is the first falling edge of the generated   clock and so defines the duty cycle, position 3 is the second rising edge   of the generated clock and so defines the clock period. Enclose multiple   edge numbers in braces {}. See the example below for specifying edge   numbers.

 

  -divide_by <arg> - (Optional) Divide the frequency of the master clock by   the specified value to establish the frequency of the generated clock   object. The value specified must be >= 1, and must be specified as an   integer.

 

  -multiply_by <arg> - (Optional) Multiply the frequency of the master clock   by the specified value to establish the frequency of the generated clock   object. The value specified must be >= 1, and must be specified as an   integer.

 

  -combinational - (Optional) Define a combinational path to create a   "-divide_by 1" generated clock.

 

  -duty_cycle< <arg> - (Optional) The duty cycle of the generated clock   defined as a percentage of the new clock period when used with the   -multiply_by argument. The value is specified as a percentage from 0.0 to   100.

 

  -invert - (Optional) Create a generated clock with the phase inverted from   the master clock.

 

  -edge_shift <arg> - (Optional) Shift the edges of the generated clock by   the specified values relative to the master clock. See the example below   for specifying edge shift.

 

  -add - (Optional) Add the generated clock object to an existing clock group   specified by <objects>.

 

  Note: -master_clock and -name options must be specified with –add   -master_clock <arg> - (Optional) If there are multiple clocks found on the   source pin or port, the specified clock object is the one to use as the   master for the generated clock object.

 

  Note: -add and -name options must be specified with -master_clock   -quiet - (Optional) Execute the command quietly, returning no messages from   the command. The command also returns TCL_OK regardless of any errors   encountered during execution.

 

  Note: Any errors encountered on the command-line, while launching the   command, will be returned. Only errors occurring inside the command will be   trapped.

 

  -verbose - (Optional) Temporarily override any message limits and return   all messages from this command.

 

  Note: Message limits can be defined with the set_msg_config command.

 

  <objects> - (Required) The pin or port objects to which the generated clock   should be assigned. If the specified objects already have a clock defined,   use the -add option to add the new generated clock and not overwrite any   existing clocks on the object.

 

Examples:

 

  The following example defines a generated clock that is divided from the   master clock found on the specified CLK pin. Since -name is not specified,   the generated clock is assigned the same name as the pin it is assigned to:

 

    create_generated_clock -divide_by 2 -source [get_pins clkgen/sysClk] fftEngine/clk

 

  The following example defines a generated clock named CLK1 from the   specified source clock, specifying the edges of the master clock to use as   transition points for the generated clock, with edges shifted by the   specified amount. In this example, the -edges option indicates that the   second edge of the source clock is the first rising edge of the generated   clock, the third edge of the source clock is the first falling edge of the   generated clock, and the eighth edge of the source clock is the second   rising edge of the generated clock. These values determine the period of   the generated clock as the time from edge 2 to edge 8 of the source clock,

  and the duty cycle as the percentage of the period between edge 2 and edge

  3 of the source clock. In addition, each edge of the generated clock is

  shifted by the specified amount:

 

    create_generated_clock -name CLK1 -source CMB/CLKIN -edges {2 3 8} \

    -edge_shift {0 -1.0 -2.0} CMB/CLKOUT

 

  Note: The waveform pattern of the generated clock is repeated based on the

  transitions defined by the -edges argument.

 

  This example creates two generated clocks from the output of a MUX, using

  -master_clock to identify which clock to use, using -add to assign the

  generated clocks to the Q pin of a flip flop, and using -name to define a

  name for the generated clock, since the object it is assigned to has

  multiple clocks assigned:

 

    create_generated_clock -source [get_pins muxOut] -master_clock M_CLKA \

    -divide_by 2 -add -name gen_CLKA [get_pins flop_Q]

    create_generated_clock -source [get_pins muxOut] -master_clock M_CLKB \

    -divide_by 2 -add -name gen_CLKB [get_pins flop_Q]

 

  The following example renames the automatically named clock that is derived

  by the Vivado Design Suite on the MMCM clock output:

 

    create_generated_clock -name CLK_DIV2 [get_pins mmcm/CLKOUT1]

 

See Also:

 

   *  check_timing

   *  create_clock

   *  get_generated_clocks

   *  get_pins

   *  report_clock

   *  set_clock_latency

   *  set_clock_uncertainty

   *  set_propagated_clock

 

Generated clocks are driven inside the design by special cells called Clock Modifying Blocks (for example, an MMCM), or by some user logic.

The XDC command "create_generated_clock" is used to create a generated clock object.

Syntax:

create_generated_clock  [-name <arg>] [-source <args>] [-edges <args>]

                        [-divide_by <arg>] [-multiply_by <arg>]

                        [-combinational] [-duty_cycle <arg>] [-invert]

                        [-edge_shift <args>] [-add] [-master_clock <arg>]

                        [-quiet] [-verbose] <objects>

 

This article discusses the common use cases of creating a generated clock.

For more information on create_generated_clock, please refer to (UG903).

Solution

Generated clocks are associated with a master clock from which they are derived.

The master clock can be a primary clock or another generated clock.

Please ensure you define all primary clocks first.

They are required for defining the generated clocks.

 

Use Case 1: Automatically Derived Clocks

 

For Clock Modifying Blocks (CMB) such as MMCMx, PLLx,IBUFDS_GTE2, BUFR and PHASER_x primitives, you do not need to manually create the generated clocks.

Vivado automatically creates these clocks, provided the associated master clock has already been defined.

You only need to create the primary clock that is feeding into the CMB.

The auto-generated clock names can be reported by the report_clocks command in the synthesized or implemented design so that you can use them in other commands or constraints.

 

It is possible to force the name of the generated clock that is automatically created by the tool.

See "Use Case 2: Renaming Auto-derived Clocks" below.

 

An auto-generated clock is not created if a user-defined clock (primary or generated) is also defined on the same netlist object, that is, on the same definition point (net or pin).

Vivado gives the following warning message when an existing primary or generated clock prevents auto-generated clock propagation:

 

Warning:[Timing 38-3] User defined clock exists on pin <pin_name> and will prevent any subsequent automatic derivation.

 

Automatically Derived Clock Example



The following automatically derived clock example is a clock generated by an MMCM.



XDC constraint:

create_clock -name clkin -period 10.000 [get_ports clkin]

 

The report_clocks command prints the following information:

 

Clock Period Waveform Attributes Sources

clkin 10.00000 {0.00000 5.00000} P {clkin}

cpuClk 10.00000 {0.00000 5.00000} P,G {clkip/mmcm0/CLKOUT}

Use Case 2: Renaming Auto-derived Clocks

It is possible to force the name of the generated clock that is automatically created by the tool.

The renaming process consists of calling the create_generated_clock command with a limited number of parameters.

 

create_generated_clock -name new_name [-source source_pin] [-master_clock master_clk] source_object

              

A single create_generated_clock command has to specify a unique auto-derived clock to rename.

A user-defined generated clock cannot be renamed.

 

 

Renaming Auto-derived Clock Example

 

Same example in Use Case 1:

XDC constraint:

create_clock -name clkin -period 10.000 [get_ports clkin]
#renaming auto-derived clock 
create_generated_clock -name user_clk [get_pins clkip/mmcm0/CLKOUT]

Then the report_clocks command prints the following information:

Clock Period Waveform Attributes Sources

clkin 10.00000 {0.00000 5.00000} P {clkin}

user_clk 10.00000 {0.00000 5.00000} P,G {clkip/mmcm0/CLKOUT}

......

Use Case 3: User Defined Generated Clocks

When no automatic generation occurs, you will need to manually create clock modifications.

For example, for a clock divider logic that consists of LUTs and FFs, Vivado is not aware of the period relationship between the source clock and the divided clock. 

As a result, a user-defined generated clock is required for the divided clock.

This type of clock divider is not recommended in an FGPA. We recommend using an MMCM or a PLL to divide the clock.

Specify the master source using the -source option.
This indicates a pin or port in the design through which the master clock propagates.
It is common to use the master clock source point or the input clock pin of a generated clock source cell.

 

User Defined Generated Clock Example

The primary clock drives a register divider to create a divide-by-2 clock at the register output.

https://www.xilinx.com/Image/register_divider_201410200139017013.png

 

Two equivalent constraints are provided below:

create_clock -name clkin -period 10 [get_ports clkin]

# Option 1: master clock source is the primary clock source point with a 'divide by' value of the circuit.

create_generated_clock -name clkdiv2 -source [get_ports clkin] -divide_by 2 [get_pins REGA/Q]

# Option 2: master clock source is the REGA clock pin with a 'divide by' value of the circuit.

create_generated_clock -name clkdiv2 -source [get_pins REGA/C] -divide_by 2 [get_pins REGA/Q]

 

Use Case 4: Forwarded Clock through ODDR

 

In the Source Synchronous application, the clock is regenerated in the source device and forwarded to the destination device along with data.

A common method is to use clock forwarding via a double data-rate register.

In the following example, the ODDR instance in the source device is used to generate the forwarding clock for the Source Synchronous interface.

A user-defined generated clock needs to be created for the forwarding clock in order to be used in the set_output_delay constraint for the Source Synchronous interface.

Example of Creating Generated Clock at Clock Output Port:


 

create_generated_clock -name fwd_clk -multiply_by 1 -source [get_pins ODDR_inst/C] [get_ports CLKOUT]


The generated clock can then be referenced in the set_output_delay command.

For more information on set_output_delay command, please refer to (UG903).

 

Use Case 5: Overlapping Clocks Driven by a Clock Multiplexer

When two or more clocks drive into a multiplexer (or more generally a combinatorial cell), they all propagate through and become overlapped on the fanout of the cell.

For this reason, you must review the CDC paths and add new constraints to exclude false paths due to overlapping.

The correct constraints are dictated by how and where the clocks interact in the design.

In some scenarios, user-defined generated clocks need to be created for the multiplexed clock in order to correctly constrain the CDC paths.

 

Multiplexed Clock Example:


 

 

If clk0 and clk1 only interact in the fanout of the multiplexer (FDM0 and FDM1), (i.e. the paths A, B and C do not exist), it is safe to apply the clock groups constraint to clk0 and clk1 directly.

set_clock_groups -logically_exclusive -group clk0 -group clk1


If clk0 and/or clk1 directly interact with the multiplexed clock (i.e. the paths A or B or C exist), then in order to keep timing for paths A, B and C, the constraint cannot be applied to clk0 and clk1 directly.


Instead, it must be applied to the portion of the clocks in the fanout of the multiplexer, which requires additional clock definitions.

In this case, two generated clocks are created at the Multiplexer output pin and paths crossing the generated clock domains are ignored.

create_generated_clock -name clk0mux -divide_by 1 -source [get_pins mux/I0] [get_pins mux/O]

create_generated_clock -name clk1mux -divide_by 1 -add -master_clock clk1 -source [get_pins mux/I1] [get_pins mux/O]

set_clock_groups -physically_exclusive -group clk0mux -group clk1mux

 have a design consisting of Clock Wizard IP(MMCM) with input clock 100MHz at "clk_in1". Now I generate 50MHz clock at output "clk_out1".
As per UG903(Page 88 and 89), Xilinx automatically derives constraints "create_generated_clock" for the clocks generated using PLL, MMCM etc.


But when I checked my design I cannot  see any "create_generated_clock" constraints defined automatically for the "clk_out1".

You only need to create_clock for the input port of MMCM, then the output clock of MMCM will be automatically generated. You don't need to create_generated_clock on the output of MMCM manually.

You can check the result of report_clocks to see the auto-generated clocks.

Ex: clk_pin_p is the input clock for MMCM, clk_rx_clk_core/clk_tx_clk_core is the output of MMCM. 

The following example shows the use of the multiple through points to

  define both a specific path (through state_reg1) and alternate paths

  (through count_3 or count_4), and writes the timing results to the

  specified file:

 

    report_timing -from go -through {state_reg1} -through { count_3  count_4 } \

    -to done -path_type summary -file C:/Data/timing1.txt

 

report_timing –from[get_pins clk_in_IBUF_BUFG_inst/ clk_in_IBUF_inst] –to[get_pins freq_cnt_reg[24]/R]

 

 In a 2 -stage pipeline, you break down a task into two sub-tasks and execute them in pipeline. Lets say each stage takes 1 cycle to complete.

That means in a 2-stage pipeline, each task will take 2 cycles to complete (known as latency). Also as you can see below once pipeline is full, every cycle you get “2” tasks completing (known as throughput)


In a 3-state pipeline, you break down a task into three sub-tasks and execute them in pipeline.


Assuming again each stage takes 1 cycle, you can see that in a 3 stage pipeline, the latency is high (it takes 3 cycles to complete) while throughput is also high (you can get 3 tasks completing each cycle)

The stages are decided accordingly to get maximum throughput with right latency.

https://www.hardwaresecrets.com/inside-pentium-4-architecture/2/



Data movement

 

 Data movement remains the challenge it is for all system designs, and therefore truly remarkable results involve careful attention to communication with memory, and with the interconnect between processors and FPGAs. Given such an interconnect with asymmetric transmission costs, data movement energy can be reduced by encoding the transmitted data such that the number of 1s in each transmitted codeword is minimized. To perform any computation, we must communicate data between the point in time and space where each intermediate data item is computed and where it is consumed. This communication can occur either through interconnect wires, if the operators are spatially located at different places, or through memories, if the operators are sequentialized on a common physical operator.


 

When the embedded memory perfectly matches the size and organization needed by the application, an FPGA embedded memory can be as energy efficient as the same memory in a custom ASIC.Another point of mismatch between architecture and application is the width of the data written or read from the memory block. Memory energy scales with the data width. This is because of the data is growing in volume, velocity, complexity and variability among other dimensions data analytics applications progressively require results from analysis in realtime.

 

Catapult cloud field-programmable gate array architecture accelerates both cloud services and the Azure cloud platform; Azure is an open, flexible, enterprise-grade cloud computing platform.

Project Catapult employs an elastic architecture that links FPGAs together in a 6x8 ring network that provides 20Gbps of peak bidirectional bandwidth at sub-microsecond latency. Microsoft is putting FPGAs on PCI Express networking cards in every new server deployed in its data centers. FPGAs handle compression, encryption, packet inspection and other rapidly changing tasks for data center networks that in six years have jumped from Gbps to 50Gbps data rates. They can score, filter, rank and measure the relevancy of text and image queries on Bing.Using the Catapult system design, Microsoft Azure and Bing servers are being deployed.

 

Intel Xeon processor-based nodes where each node offers 64 GB of memory and an Intel/Alter a Stratix V D5 FPGA with 8 GB of local DDR3 memory. The I/O engine on the FPGA transfers data of specified length to/from contiguous regions in the communication buffer. We split the buffer into two parts .The source and destination buffer addresses, amount of data to be transferred to/from these buffers. As the computation for successive tiles is overlapped in the FPGA design the overhead associated with making the compute request is hidden when a large enough number of tiles are processed successively. The FPGA reads from the source buffer and writes back the result tiles comprising elements to the destination buffer, and sets the completion flag.

 

SDAccel have a framework for developing and delivering FPGA accelerated data center applications. The environment includes a familiar software development flow with an Eclipse and an architecturally optimizing compiler that makes efficient use of FPGA resources. Developers of accelerated applications use a familiar software programming work flow. The SDAccel compiler provides the capabilities of the FPGA fabric through the processes of scheduling, pipelining, and dataflow. 

 

FPGAs has the ability to be dynamically reconfigured. Loading a compiled program into a processor, reconfigure the FPGA during run time can re-purpose the resources of the FPGA to implement additional kernels as the accelerated application runs. Host application build process using GCC.Each host application source file is compiled to an object file (.o).The object files (.o) are linked with the Xilinx SDAccel runtime shared library to create the executable (.exe).Each kernel is independently compiled to a Xilinx object (.xo) file. RTL kernels are compiled using the package_xo utility.

 The kernel files are linked with the hardware platform to create the FPGA binary (.xclbin). xocc generates the simulation models of the device contents. The  build target defines the nature of FPGA binary .SDAccel provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary.

 

The kernel code is compiled into a hardware model which is run in a dedicated simulator. This build and run loop takes longer .This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance. The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.At the host application, setup the environment to identify the OpenCL platform and the device IDs and specify a context, create a command queue, build a program, and spawn one or more kernels.

 

The issues facing systems today are the impact of moving data over long distances to CPUs, and the inherent difficulty of optimizing the performance and power efficiency of data processing.  FPGA provides inherent flexible acceleration and offload capabilities. FPGA integration with high memory capacity minimizes data movement by bringing processing resources to the data. SDA engine as a flexible environment which allows engineers to experiment with near data processing while exploring the interaction between application software, drivers, firmware, FPGA bitfiles and memory.

 

Energy Efficient Data Encoding Techniques

 

Differential encoding

 

Differential encoding is an example of algebraic encoding and it transmits the bit-wise XOR between successive words. In many cases, the bus values show high temporal correlation. Due to this, the HD between successive values is either very small or very large.

 

Sparse encoding and limited-weight codes

 

Sparse encoding schemes sacrifice bandwidth to reduce the number of 1s. A K-LWC is an example of a sparse code which refers to a group of CWs having weight of at most K .

 

The limitation of LWCs is that their encoder and decoder have high logic complexity and they are not suitable for on-chip interconnects

 

 

Redundancy in encoding techniques

 

The encoding techniques that do not use any redundancy or metadata are termed as irredundant encoding techniques. By comparison, several encoding techniques use redundancy in either space or time. Space redundancy implies use of additional bus lines and time redundancy refers to use of additional transfer cycles.

 

Techniques based on value similarity

 

The first technique works on the observation that in GPU applications, nearly twenty-two out of thirty-two bits in a word are 0s and a data-word has on-average nine leading 0s. The first technique flips all the bits of only positive data values. This is realized by XNORing the leading “sign bit” with all the other bits.

 

The second technique works based on value similarity, but uses HD instead of arithmetic distance. Here, one lane of a warp is chosen as a pivot and the remaining lanes are termed as non-pivot. Then, for every block of data accessed by a warp, all the “non-pivot lanes” perform XNOR with the “pivot lane”. Thus, the bits of those lanes that match the “pivot lane” are changed to 1. As for the choice of the “pivot lane”, they note that both the arithmetic and Hamming distances of lane-0 from other lanes are generally larger than that of the middle-positioned lanes. They choose lane-20 as the pivot lane since it shows the smallest average HD from other lanes.

 

The third technique works on the observation that in the 64b instruction, some positions favor a bit value, e.g., 25th position favors ‘0’. Based on this preference-information, they generate an “average mask” at compile time, such that for bit positions favoring 0, the mask-bit at those positions are set to 0, and then, the remaining positions are set to 1. This mask is XNORed with all the instructions to increase the frequency of 1s.

 

 

► Saving processor cycles by offloading the computation

 

► High performance of the PL-based accelerator itself

 

► Lower latency

 

► Higher throughput

 

► Several times faster compared to software-based computation

 

► Ensure that data transfer delays between PS and accelerator do not eliminate the performance gain from the accelerator

 

 

Techniques based on value predictors

 

 

Value predictors (VPs) for reducing BTs on on-chip buses. Their technique runs the same VP on both the sender and the receiver. Since both VPs work synchronously and based on identical values, their predictions are identical. The prediction of VP at the sender side is compared with the actual value to be sent over the bus. In case of a match, nothing is transmitted over bus, and the value is obtained from the VP at the receiver. In case of a mismatch, the original value itself is transmitted along with a redundant control bit.

 

With increasing accuracy of VPs, their technique saves increasing amount of energy. By using prediction confidence, their technique can be improved further. They assume that the set of possible predicted values are sorted by confidence. Also, let the bus-width be W-bits. Firstly, the value with largest confidence is mapped to a CW with least energy overhead. For TS scheme, the CW with least energy overhead is the all-zero symbol which leads to zero BTs. Then, the next W values are mapped to symbols with only single bit set, i.e., having an HD of one. After this, symbols of higher HD are used. When a new input word is sent to the predictor, it checks the existing encoded values. In case of a hit, the corresponding CW is sent, otherwise, either the original data or the inverted original data are sent.

 

Techniques based on storing

 

 Frequent values in a table FV encoding techniques work by saving frequent values in tables at both sender and receiver, and transmitting only the index of the value in the table, which reduces the number of BTs significantly.

 

Energy Consumption of Data Movement

 

Recent DDR3/4, GDDR4/5, and LPDDR4 interfaces all support ODT in different forms. The IO interface consumes energy when transmitting a 0 as the current flows from VDD to GND; transmitting a 1 is effectively free. This asymmetric energy cost provides the opportunity for coding techniques to reduce the energy consumption of DDR4 and GDDR5 interfaces by reducing the number of transmitted 0s.

 

The energy consumption of the unterminated LPDDR3 IO interface is caused by charging and discharging the load capacitance of the data bus .Unlike the asymmetric energy consumption of a terminated interface, the energy consumption of the LPDDR3 IO interface is context-dependent, and is proportional to the number of 0 ->1 and 1 -> 0 transitions on the data bus.Hybrid Memory Cube (HMC)  employs SerDes links in its IO interface to provide high bandwidth and energy efficiency. In addition, the high static interface power prevents the energy efficient data encoding techniques from effectively reducing data movement energy. Photonic interconnects require significant hardware changes and also dissipate high static power.

 


                           FPGA  for defense applications

 

  

Semiconductor industry has played a large role for the last two decades where radar make strides. In today's modern radar systems, Active Electronically Scanned Array ( is the most popular architecture. An FPGA is an ideal, and in some cases necessary, solution in addressing these challenges

VPX is an ANSI standard which is defined by the VME bus to address the shortcomings in scalability and performance of on both side of the bus to bus bridging technology.VPX is to provide support for serial switch fabrics over a new high-speed connector as well as to operate in harsh environments. VPX was largely focused at the board level, and industry leaders saw a need to develop a system-level standard to improve interoperability and reduce customization, testing, cost, and risk.

The VPX form factor is used today for a wide range of applications and deployed environments. Applications include from rail management systems to active electronically scanned array radars on high-performance fighter aircraft, and environments vary from lab-benign to tracked vehicles and unpressurized bays in high-altitude unmanned aerial vehicles .The ruggedness and huge bandwidth provided by VPX make it a natural choice for many projects.

Next-generation radar architectures such as digital phased array and synthetic aperture radar  with ground moving target indicator  will be the emerging technology. Parameters such as high-performance data processing, ultra-wide bandwidth, high dynamic range, and adaptive systems needed for diverse mission requirements are some of the most common challenges to system designers to achieve this. Using floating-point technology with  Stratix FPGA series and variable-precision digital signal processing (DSP) allows the designer to define the needed precision for each stage of the design. Logic and DSP resources are used efficiently while reducing power consumption.

The FrontEnd Cluster is composed on N FGPA resources, and a Control Processor. Data paths in and out of the FPGAs are implemented by using serial interconnects and  routed over a VPX Backplane. The FrontEnd Cluster is responsible for the initial data acquisition and preprocessing of the raw input data. The BackEnd Cluster is comprised of a Graphics Card with NVIDIA GP GPU, or ATI GPGPU. The processor in slot one of the BackEnd cluster can be a Core 2 Duo, a Core I5, Dual Core Processor, or a Core I7, Quad Core processor. The BackEnd Cluster may be connected to the FrontEnd Cluster over the control plane by Ethernet for post processing, or it can be connected over the Data Plane, via PCIe through shared memory for instance.

 

 

Control processor is associated with the FPGA elements and can be used to manage the FGPA cards, and provides a multicore processor resource. The Control Processor is connected via a PCIe Switch to other system elements Very often the system application data flows will involve multiple data streams that are assigned to different FGPA resources.  In a Radar application, the Radar Array Antenna may divide into regions which can be associated with a sensor area to be processed by a specific FPGA resource. A switch is used to connect the SBC to the FGPAs.Once data is received and processed by the FrontEnd Cluster, it can be passed down to another resource for further processing. Post processing involves final computations and processes, or potentially a display process. The two processing clusters are connected together via a hybrid switch, consisting of a PCIe Gen 2, switch fabric on the data plane, and a GbE switch fabric on the control plane.

 

 


The switch allows for transparent and nontransparent connection of the two domains. The switch provides 6 ports, or 24 PCIe lanes as well as transparent or non transparent switching. The switch can support multiple Root Complexes. In addition to PCIe fabric for Data Plane implementation, the switch supports a gigabit Ethernet for implementation of the Control Plane via 1000BT or 1000BX Ethernet. The switch provides for rear expansion Ethernet ports, and provides two copper or fiber ports out the front. It is fully managed allowing Layer Two, and Layer Three functionality.

 

The modular architecture enables users to add high performance FPGA and I/O to the base configuration addressing many application requirements such as digital RF memory (DRFM), synchronous multi-channel MIMO systems, software defined radio (SDR) and more. The VPX370 was designed with flexibility in mind: both in the ability to scale from lab to field and in application versatility. All 4DSP FPGA, I/O technology, and backplanes used in the VPX370 can be configured for rugged conduction cooled form factors making the system an ideal platform for developing IP and technology with an easy migration path to a deployed rugged system .The blade is suited to on-platform cognitive electronic warfare (EW), next-generation radar, machine learning, and AI applications that require small, powerful and scalable processing engines. Each blade combines an Intel Xeon D server-class processor, a Xilinx UltraScale field-programmable gate array (FPGA), and a mezza.

The FEP board is designed to handle the upfront processing of incoming high definition video, electro-optical, infrared, radar and other extremely high data rate signals. When used in conjunction with a dual cluster system, single or multiple TIC-FEP-VPX3b boards form the basis of a highly eff           icient front end processing system with configurable signal input options via a selection of FMC modules.


DSP/FPGA-based application that benefits from 3U VPX is signal intelligence, which is often deployed on small platforms with low-power constraints. Ground vehicle sensors, software radio, and vetronics also benefit from 3U VPX. Military ground vehicles, while large, often have very limited available space for electronics. Integrators also want to take advantage of higher-performance systems by fitting new capabilities into a fixed space. Similarly, UAVs typically have severe SWaP limitations that can make 3U VPX a perfect fit, especially for smaller UAVs. One additional application area where we’ve seen a lot of interest for 3U VPX is electronic warfare and countermeasure systems. These systems are often deployed in pods or located in out-of-the-way places on an aircraft, such as on wing roots, where there simply is not a lot of space on the airframe.

 

OpenVPX and associated standards define interfaces between Plug-In Modules and chassis for products intended to be deployed in harsh environments

Changes to support data acquisition and RF subsystems – Radial clocking for high precision clocking of A/D and D/As – Bind mate backplane optical and coax connectors to support 2-level maintenance

OpenVPX builds on VPX to add system thinking – VPX (VITA 46) has dot specifications for each protocol and some others: VME, RapidIO, PCIe, Ethernet – With OpenVPX there are profiles which make use of multiple protocols – OpenVPX profiles spell out how multiple VPX dot specifications are to be used together

OpenVPX Modules are frequently adapted for particular I/O needs using mezzanine cards – XMCs used on processing modules or carrier cards to add I/O and other features 

• PMC – PCI Mezzanine Cards – IEEE 1386.1-2001 – Use PCI with a data bus of 32 or 64 bit – XMC is becoming more common

 • XMC – ANSI/VITA 42.0 (Approved 2008) 

• PMC and XMC based on IEEE 1386-2001 Common Mezzanine Card (CMC) – Single-width is 149.0 x 74.0 mm (5.87 x 2.91 inches); double-width is 149.0 x 149.0 mm

 • There are dot specifications for various protocol options – ANSI/VITA 42.1-2006 for Parallel RapidIO 8/16 LP-LVDS – ANSI/VITA 42.2-2006 for Serial RapidIO – ANSI/VITA 42.3-2006 for PCI Express (PCIe) – most popular for current systems

OpenVPX Modules are frequently adapted for particular I/O needs using mezzanine cards – FMCs used on FPGA boards to add things like: 

• Analog to digital and digital to analog converters 

• Fiber-optic transceivers.


High Frequency Trading

           

 

High Frequency Trading (HFT) over the past years  has become an increasingly important element of financial markets.  HFT describes a set of  techniques within electronic trading of stocks and derivatives, where a large number of orders are injected into the market at sub-millisecond round-trip execution times . High frequency traders aim to end the trading day “flat” without holding any  significant positions and utilize several strategies to generate revenue, by buying and selling stock at very high speed. In fact, studies show that a high frequency trader holds stock for only 22 seconds in average . According to the Aite Group, the impact of HFT on the financial markets is substantial, accounting for more than 50% of all trades in 2010 on the US-equity market with a growth rate of 70 % in 2009 .High frequency traders utilize a number of different strategies, including liquidity-providing strategies, statistical arbitrage strategies and liquidity detection strategies .

 

 In liquidity-providing strategies, high frequency traders try to earn the bid-ask spread which represents the difference of what buyers are willing to pay and sellers are willing to accept for trading stock. High volatility and large bid-ask spreads can be turned into profits for the high frequency trader while in return he provides liquidity to the market and lowers the bid-ask spread for other participants, adopting the role of a market maker.Liquidity and low ask-bid spreads are desirable as they  reduce trading costs and improve the informational efficiency  of asset price . Traders that employ arbitration strategies  on the other hand, try to correlate pricing information between related stocks or derivates and their underlying prices.

 

Liquidity detection comprises strategies that seek to discover large orders by sending out small orders which can be leveraged by the traders. All strategies have in common that they require absolute lowest round-trip latencies as only the fastest HFT firm will be able to benefit from an existing opportunity. Electronic trading of stocks is conducted by sending orders in electronic form to a stock exchange. Bid and ask orders are then matched by the exchange to execute a trade. Outstanding orders are made visible to the market participants through so-called feeds. A feed is a compressed or uncompressed real time data stream provided by an independent institution like the Options Price Reporting Authority (OPRA). A feed carries pricing information of stocks and is multicasted to the market participants using standardized protocols which are generally transmitted over UDP over Ethernet. The standard protocol that is applied is the Financial Information Exchange (FIX) protocol Adapted for Streaming (FAST) which is used by multiple stock exchanges to distribute their market data .

 

A typical HFT system consists of four main building blocks:network stack, financial protocol parsing, order book handlingand custom application layer. Financial exchanges broadcast market updates along an Ethernet connection at typical linerates of 10 Gb/s . The network stack receives the messagessent by the financial exchange and performs the initial packet processing. 

The packets are usually compressed in a domain specific format to save on bandwidth; a prominent example is FAST (FIX Adapted for STreaming), which is an adaptationof FIX (Financial Information Exchange) . 

The financial protocol parsing block changes the compressed packets into meaningful limited and market orders that are used to build the order book. The order book gives a view of thE current  market price by ordering bids (buying offers) and asks (selling offers) according to their prices with the highest bidding priceand lowest asking price at the top of the book. Finally, thetop bid and ask entries are used by the custom application layer to analyze the market and consequently issue buy/sellorders. These orders are then encoded using the same financialprotocol and sent back over the network. The time intervalbetween receiving incoming packets of an order into thesystem and sending out the triggered response packets isdefined as the time-to-trade or the round-trip latency.Due to the importance of having low-latency HFT systems,traders and hardware vendors have been in an armsrace to lower the total round-trip latency. 

Typical high-end processor-based systems with specialized Network InterfaceControllers (NICs) can react to market orders in a few microseconds. But due to the need of further decreasinglatency beyond that, designing application-specific hardwareaccelerators started to gain more attention in the HFT domain,especially FPGA-based accelerators due to their flexibility andre-programmability. FPGA-based systems proved to achievefar lower latency, approaching a four-fold reduction comparedto conventional NIC solutions, often with more deterministic response times